AD-A276  506 


A002/Code  W;  A003/Documenting;  A004/Final  Report  •  February  23,  1994 


PROOF  OF  CONCEPT  FOR  THE  REWRITE  RULE  MACHINE: 
INTERENSEMBLE  STUDIES 


Jos6  Meseguer,  Principal  Scientist 
Computer  Science  Laboratory 

SRI  Project  ECU  1505 


Prepared  for: 

Chief  of  Naval  Research 
800  North  Quincy  Street 
Arlington,  VA  22217-5660 


DTIC 


ELECTE 
MAR  081994 


Attn:  Dr.  Keith  Bromley,  Scientific  Officer 
cc:  Mary  Ann  Cook,  Contracting  Officer 

Director,  Naval  Research  Laboratory,  Attn:  Code  2627 
Administrative  Contracting  Officer 
Defense  Technical  Information  Center 


Contract  No.  N00014-90-C-02W 


Approved: 

Mark  Moriconi,  Director 
Computer  Science  Laboratory 


Donald  Nielson,  Vice  President 
Computing  and  Engineering  Sciences  Division 


3  i«  : _ ♦  ’  w 


public  i 
distribution 


is 


uaunited 


94-07081 

nniiiiiii 


r  •  M.-n-oPa-w  O  A  94025-3493  •  (415)326-6200  •  FAX:  (415)326-551?  •  Telex.  334486 

^  r  r  r~  ■ 


94 


O' 

o 


Best 
Available 
Copy 


Final  Report  for  the  Project 
“Proof  of  Concept  for  the 
Rewrite  Rule  Machine: 
Interensemble  Studies” 

Patrick  Lincoln,  Jose  Meseguer,  Babak  Taheri,  and  Timothy  Winkler 
SRI  International,  Menlo  Park,  CA  94025 


1  Introduction 

Under  the  direction  of  Dr.  Jose  Meseguer,  the  Rewrite  Rule  Machine  (RRM)  team 
began  this  project  on  15  August  1990,  and  completed  the  work  on  14  August  1991.  In 
addition  to  the  project  participants  listed  as  authors,  Prof.  Joseph  Goguen  of  Oxford 
University  served  as  a  consultant. 

The  main  goal  was  to  learn  through  simulation  about  the  functionality  and  per¬ 
formance  on  realistic  applications  of  an  RRM  system  consisting  of  a  collection  of 
RRM  ensemble  chips  (each  such  chip  being  a  SIMD  processor)  connected  on  a  net¬ 
work,  and  to  design  mechanisms  to  support  the  simultaneous  parallel  computation  of 
applications  across  many  such  ensemble  chips. 

To  achieve  these  goals  we  first  built  a  high-level  interensemble  simulator  and  ran 
a  collection  of  benchmarks  on  it  under  varying  assumptions  about  several  architec¬ 
tural  parameters  to  obtain  a  first  estimate  of  the  communication  requirements  for 
the  RRM  and  to  determine  the  feasibility  of  those  requirements  in  view  of  existing 
network  technology.  Using  a  second,  very  detailed  register-transfer  level  simulator  of 
a  single  ensemble  and  the  performance  results  of  a  collection  of  applications  run  on 
it,  together  with  modeling  above  the  ensemble  level,  we  also  estimated  interensem¬ 
ble  performance;  in  this  way  we  were  able  to  obtain  more  detailed  and  accurate 
interensemble  performance  estimates.  Mechanisms  supporting  parallel  computations 
across  many  ensembles  were  also  studied  and  designed. 

Section  2  of  this  report  gives  an  introductory  overview  of  the  RRM.  Section  3 
describes  the  new  RRM  architecture  on  which  the  detailed  ensemble  simulator  and 
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the  modeling  of  higher  levels  were  based.  Section  4  describes  the  interensemble  com¬ 
putation  mechanisms.  Section  5  discusses  simulation  and  performance  estimation: 
interensemble  simulations  are  discussed  in  Section  5.1,  communication  requirements 
in  Section  5.2,  and  ensemble  simulations  and  performance  modeling  of  higher  RRM 
levels  based  on  them  in  Section  5.3.  The  code  of  the  interensemble  simulator  and  of 
several  benchmarks  run  on  it  are  given  in  Appendix  A.  A  description  of  the  ensemble 
simulator  is  given  in  Appendix  B;  and  its  code  and  that  of  benchmarks  run  on  it  are 
given  in  Appendix  C. 

2  Overview  of  the  RRM 

Following  an  overview  of  the  Rewrite  Rule  Machine  (RRM)  architecture  and  model 
of  computation  with  special  emphasis  on  the  new  ensemble  design,  we  discuss  per¬ 
formance  estimates  based  on  simulation.  The  architecture  is  a  multilevel  hierarchy, 
which  is  SIMD  at  the  lower  (chip)  levels,  and  MIMD  at  the  higher  levels.  This  en¬ 
ables  the  RRM  to  combine  the  advantages  of  the  SIMD  and  MIMD  approaches.  The 
RRM  model  of  computation  is  concurrent  graph  rewriting,  which  supports  extremely 
fine-grain  parallelism,  dynamic  resource  allocation,  and  simple  semantics. 

Since  performance  estimation  for  a  machine  like  the  RRM  is  difficult,  we  must 
carefully  justify  our  approach.  We  discuss  the  problems  and  how  we  address  them 
later  in  this  document.  Our  approach  to  performance  estimation  may  be  summarized 
as  follows:  we  chose  a  diversity  of  problems  to  stress  the  design  in  different  ways, 
including  communication,  memory,  and  computation;  we  chose  problems  representa¬ 
tive  of  different  application  areas;  and  we  built  and  used  different  simulators  to  get  a 
variety  of  performance  estimates. 

2.1  Multigrain  Concurrency  and  Applications 

Many  important  real-life  applications  involve  a  number  of  diverse,  relatively  indepen¬ 
dent  processes,  many  of  which  are  computationally  homogeneous.  For  example,  a 
large  simulation  problem  may  involve  many  independent,  loosely  coupled  processes. 

Let  us  call  a  computation  homogeneous ,  if  at  each  moment  it  consists  of  many 
instances  of  the  same  instruction  being  applied  to  many  data  items  in  parallel;  some¬ 
times  this  is  called  data  parallelism.  While  many  familiar  numerical  algorithms  have 
this  form,  many  complex  computational  tasks  are  locally  homogeneous  but  globally 
inhomogeneous. 

Because  of  its  very  fine-grain  SIMD  parallelism  at  the  chip  level  combined  with 
its  flexible  coarser-grain  MIMD  parallelism  at  the  network  level  that  allows  different 
chips  to  work  on  very  different  subtasks  of  the  same  problem  at  once,  the  RRM  can 
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Figure  1:  Concurrent  Rewriting  of  Fibonacci  Expressions 

exploit  a  problem’s  parallelism  at  several  levels.  We  call  this  property  multigrain 
concurrency ;  it  makes  the  RRM  very  well  suited  for  solving  not  only  homogeneous 
problems,  but  also  complex,  locally  homogeneous  but  globally  inhomogeneous  prob¬ 
lems  in  many  areas,  including  discrete  event  simulation,  decision  support  systems, 
rapid  prototyping,  vision,  computational  geometry,  automated  deduction,  finite  ele¬ 
ment  methods,  neural  nets,  and  hardware  simulation. 

2.2  Combining  SIMD  and  MIMD 

At  present,  the  two  main  approaches  to  massive  parallelism  are  SIMD  machines  and 
MIMD  multicomputers.  Examples  of  the  state  of  the  art  in  each  category  are  the  Con¬ 
nection  Machine,  CM-2  (Thinking  Machines  Inc.  [14,  4]),  and  the  MP1216  (MasPar 
Computer  Corporation  [23]),  for  SIMD  computers;  and  Mosaic  (Chuck  Seitz,  Caltech 
[22]),  the  J-machine  (William  Dally,  MIT  [5]),  Paragon  (Intel  Corporation),  and  the 
CM-5  (Thinking  Machines,  which  simulates  SIMD  by  MIMD  broadcast),  for  MIMD 
computers.  These  two  approaches  are  quite  different.  Each  has  unique  advantages 
not  shared  by  the  other  approach.  The  strength  of  SIMD  machines  is  their  exploita¬ 
tion  of  fine-grain  data  parallelism,  which  makes  them  a  good  choice  for  homogeneous 
problems;  their  weakness  is  their  centralized  control,  executing  the  same  code  ev¬ 
erywhere,  which  makes  them  perform  poorly  on  large  nonhomogeneous  applications. 
MIMD  machines  are  much  more  flexible  because  they  allow  different  code  to  be  run 
in  different  processors  simultaneously;  however,  their  communication — typically  asyn¬ 
chronous  interprocessor  message  passing  over  a  network — is  not  well  suited  to  data 
parallelism. 

A  key  goal  of  the  RRM  is  to  combine  the  best  of  these  two  approaches  in  a 
single  architectural  design.  It  shares  with  SIMD  machines  the  capability  for  fine- 
grain  data  parallelism,  which  is  carried  to  an  even  finer  level  in  the  RRM  ensemble; 
however,  because  of  its  decentralized  MIMD  control,  the  RRM  can  perform  well  on 
both  homogeneous  and  nonhomogeneous  problems,  whereas  SIMD  machines  can  excel 
only  on  homogeneous  problems.  Compared  with  MIMD  machines,  the  RRM  enjoys 
the  same  flexibility  and  generality,  based  on  distributed  control  and  asynchronous 
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message  passing,  but  because  the  RRM  is  SIMD  at  the  chip  level,  it  can  exploit  fine- 
grain  data  parallelism  locally,  even  for  highly  nonhomogeneous  applications,  whereas, 
at  present,  purely  MIMD  machines  can  get  large  degrees  of  parallelism  only  at  the 
interprocessor  level. 


2.3  Programmability 

The  RRM  is  programmable  in  a  wide  variety  of  declarative  ultra-high-level  languages 
that,  permit  massive  exploitation  of  implicit  parallelism  and  ease  the  creating  and 
porting  of  parallel  programs.  We  believe  that  declarative  languages  are  good  choices 
for  programming  such  applications  as  vision,  real-time  plant  control,  simulations, 
and  expert  systems,  because  they  do  not  require  explicit  commitment  to  specific 
forms  of  synchronization  or  scheduling.  These  convictions  are  supported  by  extensive 
simulations,  and  by  compilation  techniques  [12,  1,  20]  making  functional  (e.g.,  OBJ 
[8]),  object-oriented  (e.g.,  Maude  [17],  FOOPS  [11]),  and  relational  (e.g.,  Eqlog  [10]) 
programming  languages  easy  to  compile  into  RRM  code. 

However,  it  is  a  fact  of  life  that  some  parts  of  large  applications  programs  have 
already  been  written,  and  it  may  not  be  practical  to  rewrite  them  in  a  declarative 
language.  Because  its  flexible  model  of  computation  also  supports  imperative  features, 
a  compiler  for  the  RRM  from  a  conventional  language,  even  a  sequential  one,  could 
be  written  relatively  straightforwardly. 

2.4  The  Concurrent  Rewriting  Model  of  Computation 

The  RRM’s  model  of  computation  is  concurrent  rewriting.  In  this  model,  data 
are  terms  constructed  from  a  given  set  of  constant  and  function  symbols,  and  a 
program  is  a  set  of  equations  that  are  interpreted  as  left  to  right  rewrite  rules.  The 
lefthand  side  (abbreviated  LHS)  and  righthand  side  (RHS)  of  a  rewrite  rule  may  have 
variables  as  well  as  function  symbols.  A  variable  can  be  instantiated  with  any  term  of 
the  appropriate  sort,  and  a  set  of  instantiations  for  variables  is  called  a  substitution. 

A  rewriting  computation  starts  with  a  given  term  as  its  data  and  a  given  set 
of  rewrite  rules  as  its  program.  Applying  a  rewrite  rule  has  two  phases,  called 
matching  and  replacement.  The  matching  phase  attempts  to  find  a  substitution 
that  yields  a  subterm  of  the  input  term  when  applied  to  the  rewrite  rule’s  lefthand 
side.  Then,  in  the  replacement  phase,  the  matched  subterm,  called  the  redex,  is 
replaced  by  the  righthand  side  of  the  rule,  instantiated  with  the  same  substitution. 
Rules  are  applied  until  no  more  matches  can  be  found;  then  the  resulting  term  is 
called  reduced  and  considered  to  be  the  final  result. 

In  the  concurrent  rewriting  model  of  computation,  more  than  one  rule  can  be 
applied  at  once,  and  each  rule  can  be  applied  to  many  subterms  of  the  given  term 
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Figure  2:  Hierarchical  Structure  of  the  RRM 

at  once.  Let  us  explain  this  by  example.  Here  is  a  simple  program  to  compute  the 
Fibonacci  numbers: 

(1)  fibo(O)  =  0 

(2)  fibo(l)  =  1 

(3)  fibo(N)  »  fibo(N-2)  +  fibo(N-l) 

if  N  >  1 

If  you  give  fibo(3)  as  data,  the  top  node  will  match  rule  (3);  thus  the  whole  term 
will  be  replaced  by 

fibo(l)  +  fibo(2) 

In  the  next  step,  the  first  f  ibo  node  will  match  rule  (2),  and  the  second  f  ibo  will 
match  rule  (3)  again,  and  the  simultaneous  application  of  these  rules  yields 
1  +  (fibo(O)  +  fibo(l)) 

in  just  one  step  of  concurrent  rewriting.  Figure  1  illustrates  these  two  concurrent 
rewriting  steps,  using  tree  representation  for  expressions. 

We  say  that  a  concurrent  rewriting  computation  is  SIMD,  when  just  one  rewrite 
rule  is  applied  concurrently  at  each  moment;  in  the  RRM,  this  style  of  concurrent 
rewriting  is  realized  by  an  ensemble  chip,  as  explained  later.  If  several  rules  are 
concurrently  being  applied,  each  to  possibly  many  instances,  we  have  MIMD  con¬ 
current  rewriting;  this  general  case  is  the  correct  model  for  the  RRM  as  a  whole.  See 
[9]  for  general  background  on  the  concurrent  rewriting  model,  [6]  for  definitions  of 
SIMD  and  MIMD  rewriting  (called  parallel  and  concurrent  rewriting  in  that  paper), 
and  [18,  19,  17]  for  a  definition  of  concurrent  rewriting  as  deduction  in  rewriting  logic 
and  a  systematic  treatment  of  concurrent  object-oriented  computation  by  means  of 
concurrent  rewriting. 

Two  additional  topics  treated  in  [9]  deserve  mention.  The  first  is  sharing,  which 
permits  a  common  substructure  of  two  or  more  given  structures  to  be  shared  between 
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them,  rather  than  requiring  that  it  be  duplicated.  This  leads  to  directed  acyclic- 
graphs  rather  than  just  trees.  The  second  topic  is  evaluation  strategies,  which  are 
annotations  that  impose  restrictions  on  concurrent  execution  in  order  to  improve  the 
performance  of  parallel  computations.  A  strategy  for  a  function  /  of  7 1  arguments 
consists  of  a  set  {i!,...,**}  C  indicating  the  argument  places  that  should 

be  already  reduced  before  a  rewrite  rule  match  for  /  is  attempted.  For  example, 
if _then_else_f i  is  typically  computed  with  strategy  {1},  and  integer  addition  with 
“bottom-up"  strategy  {1,2}. 

3  The  RRM  Architecture 

The  RRM  architecture  is  hierarchical,  with  each  unit  consisting  of  a  collection  of 
cooperating  units  at  the  next  lower  level.  The  most  basic  processing  element  is  the 
cell,  with  four  cells  making  up  a  tile.  An  ensemble  chip  contains  hundreds  of  cells 
(576  is  our  current  estimate).  A  cluster  is  a  collection  of  ensemble  chips  connected 
on  a  board,  and  the  machine  as  a  whole  is  a  network.  Figure  2  provides  a  pictorial 
representation  of  the  RRM  hierarchy. 

A  single  ensemble  yields  very  fast,  extremely  fine-grain  SIMD  rewriting,  but  RRM 
execution  is  coarse-grain  MIMD  at  the  cluster  and  network  levels,  since  each  ensemble 
independently  executes  its  own  rewrites  on  its  own  data,  communicating  with  other 
ensembles  when  necessary. 

3.1  Cell,  Tile,  and  Ensemble  Architecture 

The  most  basic  computational  element  in  the  RRM  is  the  cell  [16,  2],  which  stores 
one  data  item  with  pointers  to  other  cells,  and  also  provides  basic  computational  and 
communication  capabilities;  thus  cells  mix  storage,  computation,  and  communication. 
A  cell  consists  of: 

•  Several  registers  (mostly  16-bit),  including: 

-  token,  which  encodes  the  operation  or  constant  symbol  of  a  data  node, 

-  left  and  right,  which  point  to  the  descendant  nodes1, 

-  a  32-bit  marks  register,  which  holds  volatile  information  (similar  to  condi¬ 
tion  codes), 

-  flags,  which  holds  less  volatile  information,  such  as  type  and  reduction 
status, 

1  Unary  operations  only  use  left,  and  n-ary  operations  for  n  >  2  are  decomposed  into  binary 
ones. 
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-  Twelve  general-purpose  registers,  including  ntoken,  nleft,  nright  and 
nf  lags. 

•  An  ALU  to  operate  on  and  test  the  contents  of  registers. 

•  Interfaces  to  communication  channels  and  the  controller. 

We  divide  the  silicon  area  of  the  ensemble  chip  into  a  12  x  12  mesh  of  tiles, 
each  with  four  cells.  Adjacent  tiles  are  directly  connected  by  short  wires,  so  that 
placing  logically  linked  nodes  in  cells  located  in  adjacent  tiles  permits  very  efficient 
communication.  Placing  several  cells  in  one  tile  increases  the  probability  of  logically 
related  data  being  in  adjacent  cells. 

Our  new  ensemble  design  is  simpler  and  has  substantially  better  overall  perfor¬ 
mance  than  previous  designs  [7,  2],  Its  simpler  instructions  allow  a  faster  clock  (100 
MHz  seems  a  reasonable  estimate)  and  provide  much  better  support  for  communica¬ 
tion  between  cells. 

An  ensemble  has  a  single  SIMD  controller  that  broadcasts  its  instructions  to  all 
cells.  The  controller  can  obtain  very  fast  feedback  (one  clock  cycle)  about  the  state 
of  the  cells  (such  as  type  of  data  and  operation  symbols  in  cells,  remote  references, 
success  or  failure  of  an  instruction,  and  termination),  and  can  use  such  feedback 
to  branch  to  different  SIMD  code  segments.  Obeying  SIMD  instructions,  cells  can 
communicate  with  adjacent  cells  (each  cell  has  16  adjacent  cells  in  its  4  adjacent 
tiles)  to  find  local  patterns  for  rewriting;  hundreds  of  such  patterns  may  be  found  and 
transformed  simultaneously.  Other  SIMD  instructions  allow  communication  among 
nonadjacent  cells  using  special  row  and  column  buses,  relocation  of  data,  and  input- 
output.  The  short  buses,  called  ports,  allowing  fast  communication  of  each  cell  with 
the  16  cells  in  the  north,  south,  east,  and  west  (N,  S,  E,  and  W)  neighboring  tiles,  as 
well  as  the  row  and  column  buses  used  for  communication  of  nonadjacent  cells  under 
SIMD  control,  are  shown  in  the  figure  below. 
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SIMD  concurrent  rewriting  takes  place  hy  broadcasting  instructions  that,  imple¬ 
ment  matching  and  then  replacement  of  the  patterns  found.  Although  for  very  regular 
computations  it  is  possible  to  avoid  remote — i.e.,  not  physically  adjacent — references 
within  a  single  ensemble,  in  general  the  dynamic  nature  of  the  computation  will 
require  remote  references,  and  then  matching  will  require  relocation  of  some  data. 
This  is  accomplished  with  specialized  instructions  and  chip-level  hardware  support, 
including  cell  and  tile  features,  and  buses  for  communication  between  distant  cells. 

We  use  a  reference  counting  scheme  for  storage  management,  both  within  ensem¬ 
bles  and  in  the  RRM  as  a  whole.  We  have  fully  simulated  the  details  of  this  within 
the  ensemble  for  the  examples  discussed  later  in  this  report. 


3.2  Cluster  and  Network  Architecture 

The  cluster  architectural  level  corresponds  to  board-level  structure  in  the  actual  im¬ 
plementation.  At  this  level,  ensemble  chips  can  be  arranged  in  a  two-dimensional 
(2D)  mesh  with  fast  connections  to  each  of  four  neighbors,  giving  8  connections  per 
ensemble  (4  in  and  4  out).  With  current  technology,  these  could  be  16-bit-wide  con¬ 
nections  running  at  50  MHz,  giving  800  Mbps  per  connection  and  6.4  Gbps  total 
bandwidth  per  chip.  Additional  interconnection  hardware  at  the  board  level  beyond 
the  fast,  local  connections  is  also  desirable,  as  in  the  iWARP  [3]  and  DataWave  [21] 
designs.  The  performance  we  assume  is  not  that  much  beyond  that  provided  by  these 
designs;  the  iWARP  has  8  ports,  each  8  bits  wide  at  40  MHz,  giving  320  Mbps  per 
port  and  2.56  Gbps  total  (100  to  150  ns  latency),  and  the  DataWave  has  8  ports, 
each  12  bits,  at  60  MHz,  giving  5.76  Gbps  total.  We  are  estimating  that  a  cluster 
will  have  about  100  ensembles. 

The  network  level  interconnection  for  the  RRM  has  not  been  fixed.  We  have 
been  considering  ilie  wormhole  routing  networks  of  Seitz  [22]  and  Dally  [5].  Actual 
realizations  of  these  designs  have  achieved  high  communication  rates:  205  Mbps  for 
Ametek  2010,  and  200  Mbps  for  the  Intel  Paragon.  For  a  2D  mesh,  average  case 
communication  time  for  10,000  nodes  is  estimated  at  1885  ns,  or  188  clock  cycles. 
For  a  3D  mesh,  the  average  case  communication  cost  for  10,000  nodes  is  estimated 
at  976  ns,  or  98  clock  cycles. 

In  general,  interchip  communication  in  the  RRM  is  asynchronous  message  passing 
that  imposes  no  critical  timing  requirements  on  the  network  or  switching  technol¬ 
ogy.  Thus,  the  RRM  can  exploit  the  best  communication  technology  available,  and 
take  advantage  of  any  future  improvements.  However,  the  RRM  can  exploit  local¬ 
ity  and  use  fast  local  interensemble  connections  at  the  cluster  level  to  get  very  high 
performance  for  certain  problems. 
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Figure  3:  Before  and  After  Creation  of  Ghost  (of  b) 

4  Interensemble  Computation 

Sometimes  active  cells  in  one  ensemble  need  information  from  descendants  in  an¬ 
other  ensemble.  We  call  references  from  one  ensemble  to  another  ensemble  distant 
references,  to  distinguish  them  from  the  remote  references  that  occur  from  cell  to 
nonneighbor  cell  within  a  single  ensemble.  Although  distant  references  can  be  re¬ 
duced  by  relocating  data  to  ensembles  that  reference  it  most  often,  it  is  impossible 
to  completely  eliminate  distant  data  references,  even  using  static  memory  allocation, 
because,  in  general,  structures  will  not  fit  in  a  single  ensemble.  To  efficiently  support 
interensemble  communication,  we  have  developed  two  related  mechanisms. 

For  symbolic  computation,  where  data  is  laid  out  dynamically  and  computation 
is  asynchronous  or  delay-insensitive,  we  use  an  incremental  symbolic  cache  approach. 
When  a  distant  reference  is  made,  and  it  is  determined  that  the  distant  node  should 
not  be  relocated  to  the  local  ensemble,  then  a  ghost  node  is  instead  allocated  in  a  cell 
of  the  local  ensemble,  and  data  from  the  target  of  the  distant  reference  is  copied  into 
the  ghost  node.  However,  unlike  true  relocation,  the  ghost  node  is  prevented  from 
being  the  root  of  a  rewrite,  i.e.,  is  temporarily  frozen.  Also,  a  ghost  node  maintains  a 
copy  of  the  original  distant  pointer,  and  thus  acts  as  a  passive  incremental  “symbolic 
cache”  of  data  that  actually  resides  on  another  ensemble.  After  some  time,  under 
SIMD  control,  ghost  nodes  flush  their  data  and  use  the  stored  distant  pointer  to 
refresh  their  contents.  This  flush-refresh  of  ghost  information  may  be  performed  at 
any  time.  In  addition,  at  some  times  the  parent  of  a  ghost  may  copy  the  distant 
pointer  from  its  descendant  ghost,  and  then  cause  deletion  of  the  ghost. 

For  example,  in  Figure  3,  in  the  before  (left)  picture,  ensemble  A  contains  a  cell 
labeled  a  that  has  a  distant  pointer  to  a  cell  labeled  b  in  ensemble  B.  In  the  course 
of  pattern  matching,  cell  a  requires  information  from  its  descendant  b.  In  the  after 
(right)  picture,  a  ghost  node  for  b  has  been  created  in  ensemble  A,  and  the  distant 
pointer  from  a  to  b  has  been  replaced  with  a  local  pointer  from  a  to  the  new  ghost  of 
b.  Thus  the  ghost  of  node  b  has  distant  pointers  to  the  children  of  b,  and  also  has  a 
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Ensemble  A 


Ensemble  B 


Figure  4:  SystolL  Interensemble  Computation 

copy  of  the  original  distant  pointer  (shown  as  a  dashed  arrow  to  node  b  in  ensemble 
B).  Note  that  this  process  cannot  continue  indefinitely,  since  ghosts  are  not  allowed 
to  initiate  the  matching  process  themselves.  Thus  even  if  the  structure  underneath 
b  is  large,  only  that  portion  of  the  structure  needed  to  verify  a  match  rooted  at  a  is 
ever  copied  to  ensemble  A. 

The  mechanism  used  in  the  systolic  case  is  similar  in  spirit  to  the  symbolic  case  de¬ 
scribed  above,  but  can  be  implemented  somewhat  more  efficiently,  due  to  the  locality 
of  reference  that  (in  part)  characterizes  systolic  computations.  Because  this  locality 
does  not  change  during  a  computation,  we  should  place  elements  that  communicate 
frequently  on  the  same  ensemble.  As  in  the  symbolic  case,  structures  may  be  too 
large  to  fit  on  a  single  ensemble,  and  then  we  must  place  portions  of  the  problem 
on  neighboring  ensembles,  while  keeping  local  copies  of  the  border  data  current  on 
both  ensembles.  Since  systolic  computation  is  synchronous  and  delay-sensitive,  we 
must  ensure  that  the  border  data  is  updated  correctly  when  it  is  read  by  the  local 
ensemble.  In  general  the  systolic  computation  must  wait  every  cycle  for  the  block 
transfer  of  data  between  ensembles. 

In  Figure  4,  ensembles  A  and  B  each  contain  tin  area  of  active  cells  delineated 
by  the  dashed  box.  Outside  this  box  are  border  cells  that  do  not  necessarily  perform 
computations,  but  instead  store  copies  of  the  near-edge  cells  of  neighboring  ensembles. 
Figure  4  shows  a  block  copy  of  information  from  active  cells  in  ensemble  B  to  (passive) 
edge  cells  in  ensemble  A.  After  information  from  each  neighboring  ensemble  is  copied 
into  ensemble  A,  the  next  step  of  computation  can  proceed. 

In  many  cases  we  can  overlap  communication  with  computation.  This  potential 
overlap,  or  rudimentary  pipelining  of  I/O  and  computation,  is  another  consequence  of 
our  architectural  choice  of  multiple  cells  per  tile.  The  current  design  of  the  ensemble 
with  four  cells  per  tile  allows  simultaneous  systolic  computation  of  four  distinct  two- 
dimensional  layers  at  a  time.  In  fact,  one  or  two  layers  could  perform  I/O  at  the  same 
time  that  the  other  layers  perform  their  systolic  computations.  In  this  way,  we  may 
hide  some  of  the  potential  I/O  penalty  of  interensemble  computations. 
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4.1  Load  Balancing 

Allocation  in  an  ensemble  normally  ensures  that  allocated  cells  are  neighbors  of  the 
allocating  cell.  However,  when  an  ensemble  becomes  too  full,  allocations  are  made 
on  other  ensembles.  This  process  can  be  described  as  pushing  out  computational 
subtasks.  The  SIMD  controller  can  gather  (perhaps  imprecise)  information  about 
the  utilization  level  of  an  ensemble  in  order  to  determine  when  the  ensemble  is  full. 
For  certain  computations,  it  may  be  advisable  to  push  out  subtasks  at  the  outset. 
Large  symbolic  computations  usually  require  building  and  manipulating  very  large 
term  structures,  which  may  be  distributed  over  several  ensembles  when  they  are 
initialized,  may  be  distributed  explicitly  by  a  specially  tuned  SIMD  broadcast,  or 
may  migrate  implicitly  to  neighboring  ensembles  during  computation. 

Allocation  is  important  in  architectures  like  the  RRM,  due  to  the  sensitivity  of 
computation  to  locality.  Thus,  initial  placement  may  have  a  large  impact  on  per¬ 
formance,  especially  for  relatively  short  computations  with  large  amounts  of  data. 
After  initial  allocation,  the  compiled  SIMD  code  may  explicitly  push  subcomputa- 
tions  out  of  an  ensemble,  perhaps  forming  a  ghost  node  in  its  place.  Thus  the  local 
copy  does  not  perform  rewrites  itself,  although  it  would  still  participate  passively  in 
other  rewrites. 

Finally,  automatic  migration  can  be  performed  by  pushing  subtasks  out  of  an  en¬ 
semble  based  on  the  depth  of  the  subterm  from  a  root  node  of  the  ensemble,  forcing 
subcomputations  to  be  pushed  out  more  quickly.  However,  spreading  computation 
more  quickly,  and  thus  more  evenly  among  ensembles,  trades  off  against  interensem¬ 
ble  communication  overhead.  The  techniques  for  interensemble  computation  already 
described  above  substantially  alleviate  this  overhead,  but  it  still  exists. 


5  Simulation  and  Performance  Estimation 

Estimating  the  performance  of  computer  systems  is  a  difficult  art  at  best,  and  is 
even  more  difficult  for  radically  new  machines  that  have  not  yet  been  built.  The 
performance  limitations  of  simulators  mean  that  large  problems  are  very  difficult  to 
run.  Testing  different  aspects  of  a  design  on  the  largest  possible  problems  may  force 
using  multiple  simulators  to  abstract  different  details  for  various  choices  of  perfor¬ 
mance  measure  and  problem.  But  then  it  may  be  difficult  to  justify  the  abstractions, 
and  to  ensure  that  the  problems  fit  the  assumptions  behind  their  justifications.  For 
the  RRM,  these  difficulties  seem  particularly  acute,  because  of  the  high  performance 
figures  that  we  seek  to  justify. 

Given  the  serious  performance  limitations  posed  by  trying  to  simulate  our  architec¬ 
ture  on  the  workstations  available  at  the  time  our  research  was  carried  out  (limitations 
still  applying  today  to  a  good  extent)  the  approach  to  performance  estimation  that 
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we  adopted  was  a  hybrid  one: 

•  Interensemble  simulations,  in  order  to  be  computationally  feasible,  were  per¬ 
formed  at  a  high  level  of  modeling  using  a  high-level  interensemble  simulator 
(see  Section  5.1)  not  detailed  enough  to  yield  precise  quantitative  information, 
but  useful  for  obtaining  preliminary  estimates  of  communication  requirements 
(see  Section  5.2)  and  also  for  gaining  experience  with  interensemble  computa¬ 
tions,  and  for  getting  some  rough  performance  estimates. 

•  Ensemble  simulations  for  our  new  RRM  ensemble  design  were  performed  using 
a  very  detailed  register-transfer  level  ensemble  simulator  (see  Section  5.3).  By 
running  a  widely  varied  collection  of  applications  on  this  simulator,  very  precise 
estimates  were  obtained  at  the  ensemble  level.  Using  modeling — based  on  archi¬ 
tectural  assumptions  consistent  with  the  interensemble  simulation  experiments 
and  feasible  with  current  technology — for  the  higher  levels  of  the  RRM  archi¬ 
tecture,  we  were  then  able  to  obtain  more  detailed  interensemble  performance 
estimates  for  the  RRM  as  a  whole. 

5.1  Interensemble  Simulations 

We  developed  an  interensemble  simulator  for  the  RRM  and  used  it  to  develop  and 
test  ideas  about  interensemble  communication  mechanisms  and  strategies;  the  code 
of  this  simulator  as  well  as  that  of  the  benchmarks  run  on  it  is  given  in  Appendix 
A.  For  this  simulator,  the  RRM  as  a  whole  was  modeled  as  a  2D  array  of  clusters 
which,  themselves,  were  2D  arrays  of  cells.  Clusters  were  assumed  to  be  intercon¬ 
nected  in  such  a  way  that  one  could  view  the  machine  as  a  whole  as  a  2D  array  of 
ensembles.  We  introduced  the  notion  of  clusters  to  model  a  difference  in  technol¬ 
ogy  between  board-level  interconnection  and  interconnection  at  a  larger  scale.  The 
overall  2D  topology  was  chosen  because  it  was  a  relatively  modest  interconnection 
structure  that  should  be  realizable  in  practice.  The  simulator  was  instrumented  to 
keep  track  of  communication  at  different  levels  so  that  we  could  get  some  estimates 
of  the  communication  requirements  of  the  RRM  as  a  whole  and  between  clusters. 

The  simulator  was  a  v— "  high  level  simulator,  manipulating  term  structures  by 
applying  rewrite  rules,  but  ..  term  was  considered  to  be  located  in  a  particular  en¬ 
semble  and  ensembles  had  some  limitations  on  the  total  size  of  the  terms  they  could 
contain.  The  ensembles  apply  rewrite  rules  independently,  and  then  apply  strategies 
to  determine  if  terms  should  be  pushed  out,  making  more  room,  or  pulled  in,  mak¬ 
ing  ghosts.  The  basic  high-level  actions  are  those  of  pushing  out  a  subcomputation 
(somewhat  similar  to  a  remote  procedure  call)  and  copying  results  or  partial  results 
back  in  (which  returns  a  final  result  or  provides  a  cached  version  of  a  partial  result). 
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Two  primary  strategies  were  used  for  deciding  when  to  push  subterms  out  of  an 
ensemble.  One  strategy  was  based  on  a  threshold  value  for  the  depth  of  a  term  below 
a  root  in  the  ensemble.  If  this  depth  threshold  was  chosen  to  be  fairly  shallow,  then 
the  term  would  be  forced  to  rapidly  spread  out  through  the  RRM.  This  would  have 
the  benefit  that  a  given  ensemble  would  be  unlikely  to  contain  only  an  upper  part 
of  the  tree  that  was  waiting  on  subcomputations  to  complete  and  so  be  idle.  This 
strategy  would  be  applied  regardless  of  how  full  the  ensemble  was;  the  size  limitation 
would  mainly  come  about  when  an  ensemble  refused  to  accept  a  subcomputation 
being  pushed  out  from  elsewhere.  Such  a  strategy  is  analogous  to  the  treatment  of 
allocation  within  an  ensemble. 

The  other  strategy  used  a  fullness  threshold  to  decide  when  to  push  out  a  sub¬ 
computation:  if  an  ensemble  becomes  very  full,  a  major  subcomputation  is  pushed 
out.  The  strategy  for  selecting  the  subterm  to  be  pushed  out  is  to  select  either  whole 
rooted  terms  if  they  are  small  or  to  select  a  top-most  large  subterm  of  a  very  large 
rooted  term.  (In  practice,  it  is  necessary  to  have  a  second  fullness  threshold  that 
determines  when  the  ensemble  will  accept  pushed  out  subcomputations  from  other 
ensembles.) 

Some  of  the  problems  simulated  were;  numeric  Fibonacci  number  calculation, 
Peano  arithmetic  Fibonacci,  bubble  sort,  merge  sort,  and  matrix  multiply.  These 
examples  were  chosen  because  the  patterns  of  computation  are  quite  different  in 
these  examples  and  seem  to  represent  an  interesting  variety.  The  amount  of  time 
required  to  do  the  simulations  was  a  limiting  factor  on  the  size  of  problems  run  for 
these  examples. 

Different  versions  of  the  simulator  were  used  in  simulations.  Changes  were  in¬ 
troduced  in  order  to  make  the  simulations  more  realistic;  for  example,  limits  on  the 
amount  of  communication  between  ensembles  in  a  cluster  or  between  clusters  were 
imposed  in  some  versions,  to  test  the  impact  of  such  limits. 

Some  of  the  parameters  that  we  experimented  with  were:  time  penalties  for  both 
intra-  and  intercluster  communication,  size  of  ensemble  for  allocation,  intra-  and 
intercluster  communication  limits,  relocation  size  threshold  (an  ensemble  must  be 
below  a  given  limit  of  occupancy  before  it  can  relocate  structures  inward),  push-out 
threshold  (and  an  associated  goal  for  the  size  of  the  ensemble  when  subcomputations 
have  been  pushed  out),  and  a  value  controlling  when  subcomputations  are  pushed  out 
based  on  the  depth  of  a  term  below  a  root  in  an  ensemble.  Most  of  these  parameters 
were  not  critical  in  that  small  variations  in  their  values  had  only  small  effects  on  the 
performance  of  the  simulated  RRM.  The  robustness  of  the  performance  relative  to 
these  variations  is  a  very  positive  result.  For  the  simulations  performed  to  generate 
estimates  (discussed  later),  values  were  chosen  that  were  as  realistic  as  possible  and 
on  the  conservative  side  (i.e.,  values  that  would  tend  to  produce  the  least  favorable 
result). 
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The  high  level  of  abstraction  of  the  simulation  process  means  that  the  results 
cannot  be  expected  to  be  precise  predictions  of  the  behavior  of  the  RUM.  On  the 
other  hand,  we  expect  that  the  large-scale  behavior  should  be  roughly  similar.  We 
also  now  believe  that  the  development  of  multiprocessor  interconnection  technology  is 
proceeding  so  rapidly  that  we  can  expect  to  have  networks  that  are  very  well  adapted 
to  the  RRM  architecture. 

5.2  Communication  Requirements  and  Networks 

The  following  subsections  give  an  indication  of  the  current  state  of  the  art  in  high- 
performance,  low-latency  interconnection  networks  and  then  present  the  specific  esti¬ 
mates  of  the  communication  requirements  derived  from  the  interensemble  simulations. 

5.2.1  Prospects  for  High-Performance  Interconnects 

Demand  for  very  high  performance  interconnects  is  being  driven  both  by  tightly  cou¬ 
pled  shared  memory  systems  and  by  more  experimental  distributed  memory  systems. 
Theoretical  results  have  been  abundant  in  this  area,  from  theoretical  studies  done  for 
phone  systems  to  a  large  literature  on  hypercubes  and  their  variants.  However,  careful 
comparisons  and  evaluations  of  tradeoffs  and  actual  practical  engineering  experience 
are  just  being  developed  for  the  kind  of  large-scale  interconnects  that  interest  us.  Just 
constructing  an  interconnect  for  500  processors  is  a  major  project,  probably  too  large 
for  the  academic  context  and  hard  to  justify  either  commercially  or  in  government 
funded  research  without  a  clear  use  (i.e.,  an  overall  system  architecture  using  the 
interconnect). 

The  analysis  of  design  tradeoffs  in  the  thesis  of  Dally  has  led  to  a  whole  new 
generation  of  wormhole  routing  interconnects  used  in  machines  such  as  the  Ametek 
2010,  the  Fujitsu  AP1000,  and  the  Intel  iWARP  [3].  The  iWARP  processor  is  intended 
for  very  high  performance,  very  fine  grain  systolic  computation.  A  new  processor  with 
a  similar  goal  is  the  ITT  DataWave  [21].  There  are  also  next  generation  designs  such 
as  Seitz’s  CalTech  MOSAIC  project  and  Daily’s  J-machine  project  at  MIT.  These 
designs  achieve  impressive  communication  rates:  205  Mbps  for  Ametek  2010,  200 
Mbps  for  Fujitsu  AP1000,  160  Mbps  per  port  and  2.56  Gbps  aggregate  for  the  iWARP 
(8  ports,  each  8  bits,  at  40  Mhz  and  with  100  to  150  ns  latency),  and  6  Gbps  for  the 
DataWave  (8  ports,  each  12  bits,  at  60  Mhz).  Both  the  iWARP  and  DataWave 
examples  are  interesting  because  systolic  computation  is  even  more  demanding  of 
the  interconnect  than  the  RRM  design  is  expected  to  be.  Progress  can  be  expected 
to  continue  along  these  lines,  and  good  designs  will  be  developed  and  tested  for 
systems  with  large  numbers  of  processors  (500  to  1000  or  more).  For  example,  the 
MOSAIC  system  is  planned  to  have  16000  processors  and  will  be  based  on  a  3D  mesh 
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(32  x  32  x  16). 

Another  feature  of  the  iWARP  and  DataVVave  designs  is  that  they  are  single-chip 
processor  designs.  This  allows  the  8x8  prototype  for  iWARP  to  fit  on  a  single  board. 
With  current  technology,  this  gives  a  significant  advantage  as  wiring  densities  can  be 
much  denser  on  a  board  than  off.  The  other  machines  (Ametek  2010,  Fujitsu  AP1000) 
were  based  on  processing  nodes  with  from  one  to  four  processors  per  board.  The  goal 
for  the  RRM  is  to  have  the  complete  ensemble  consist  of  an  ensemble  chip  (each 
with  576  processing  elements  (PE’s))  perhaps  with  one  or  two  more  chips  for  extra 
memory  and  routing,  thus  also  allowing  many  processors  per  board.  We  believe  that 
low  latency  and  local  high  performance  may  be  much  more  important  than  global 
bandwidth. 

In  the  future,  it  is  likely  that  there  will  be  very  important  solutions  based  on 
mixtures  of  technologies.  For  example,  CMOS  to  GaAs  with  silicon  lasers  and  a 
optical  interconnect  could  provide  multi-gigahertz  rates  over  a  single  optical  fiber.  To 
reduce  the  number  of  wires  and  maintain  bandwidth,  it  is  necessary  to  increase  the 
frequency  or  rate  of  operation,  which  suggests  a  change  in  basic  technology.  However, 
it  is  likely  that  CMOS,  or  related  technologies  will  continue  to  provide  the  highest 
density  available,  which  suggests  that  a  mixed  approach  might  be  desirable.  The 
important  point  here  is  that  there  is  much  room  for  improvement  of  interconnection 
technology  and  that  the  RRM  — because  of  its  asynchronous  model  of  computation — 
can  easily  take  advantage  of  such  improvements. 

5.2.2  Communication  Requirements 

We  have  used  the  high-level  interensemble  simulator  on  a  suite  of  characteristic  ex¬ 
amples  to  estimate  upper  bounds  on  the  communication  demands  of  an  ensemble. 
Our  upper  bounds  seem  to  be  in  the  range  of  the  newer  network  and  interconnection 
technologies  such  as  the  wormhole  routing  networks  of  Seitz  [22]  and  Dally  [5]. 

As  mentioned,  interchip  communication  in  the  RRM  is  asynchronous  message 
passing  communication  and  imposes  no  critical  timing  requirements  on  the  network 
or  switch  technology.  This  makes  the  RRM  capable  of  exploiting  the  best  commu¬ 
nications  technology  available,  and  of  taking  advantage  of  any  future  improvements 
in  such  technology.  The  following  are  important  observations  on  the  communication 
requirements  of  an  ensemble: 

•  Estimated  ensemble  I/O  rate:  160  Mbps  to  520  Mbps  (estimate  based  on  spe¬ 
cially  instrumented  interensemble  simulations). 

•  Pins  are  not  a  bottleneck:  realistic  current  estimate  is  4  Gbps  (100  pins  at  40 
Mhz). 
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•  Communication  capacity  seems  to  be  in  the  range  of  newer  network  and  inter¬ 
connection  designs  (Seitz  [22].  Dally  [5],  iWARP[3],  DataWave[21]). 

•  Average  communication  performance  is  enough;  the  RRM  design  doesn’t  make 
critical  timing  requirements  on  the  communication  network. 

5.3  Ensemble  Simulations  and  Interensemble  Performance 
Modeling 

The  new  ensemble  design  and  estimates  of  its  performance  have  been  validated  by 
running  a  variety  of  benchmarks  on  a  new  ensemble  simulator  written  in  C,  which 
models  the  ensemble  computation  in  great  detail  at  the  register  transfer  level.  A  de¬ 
tailed  description  of  this  simulator  is  given  in  Appendix  B.  The  code  of  the  simulator 
as  well  as  that  of  several  applications  run  on  it  are  given  in  Appendix  C. 

Our  simulations  at  the  ensemble  level  have  a  great  level  of  detail  and  give  quite  ac¬ 
curate  performance  estimates,  but  our  overall  performance  estimates  for  the  RRM  are 
still  preliminary,  and  more  studies  and  experiments  axe  required  to  increase  their  ac¬ 
curacy.  The  present  estimates  are  based  on  detailed  ensemble  simulations,  high-level 
interensemble  simulations,  estimates  of  communication  requirements,  and  analysis 
using  simple  approximate  models.  More  definitive  performance  estimates  will  require 
more  detailed  simulations  and  analytic  studies  for  a  wider  collection  of  examples  and 
applications. 

The  performance  models  are  based  on  simple  predictions  of  the  computation  times 
for  specific  strategies  for  performing  the  computations.  We  discuss  RRM  performance 
predictions  for  a  variety  of  examples  chosen  because  their  patterns  of  computation 
are  representative  of  different  kinds  of  computations;  they  represent  basic  examples 
of  general  symbolic  computations  (numeric  Fibonacci  and  the  TAK  function),  highly 
regular  symbolic  computations  (sorting),  and  discrete  event  simulations  of  a  systolic 
nature  (fluid  flow  and  a  simple  hardware  simulator). 

When  describing  RRM  performance  at  the  cluster  or  network  levels,  we  specify 
efficiency  as  a  percentage  of  the  ideal  performance.  The  ideal  performance  corresponds 
to  a  linear  extrapolation  of  a  single  ensemble’s  performance,  i.e.,  a  linear  speedup. 
We  will  also  give  “idealized  Sun-relative  speedup,”  which  simply  is  the  product  of  the 
number  of  ensembles,  the  Sun-relative  speedup,  and  the  efficiency. 

We  assume  a  100-MHz  clock  and  a  12  x  12  array  of  tiles  requiring  approximately 
6  million  transistors.  These  figures  seem  achievable  since  speeds  and  sizes  of  this  kind 
have  already  been  demonstrated.  For  example,  the  1991  Hot  Chips  conference  [15] 
presented  two  chips  with  100-MHz  clocks  (one  of  them  with  4.1  million  transistors), 
and  another  chip  with  14  million  transistors. 
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There  are  many  different  performance  measures  for  machines,  including  machine 
instruction  execution  rates,  and  actual  elapsed  time.  The  most  intrinsic  ensemble 
performance  estimate  is  the  number  of  clock  cycles  needed  for  a  given  computation. 
By  assuming  a  specific  clock  rate,  this  measure  can  be  translated  into  seconds.  How¬ 
ever,  some  relative  comparison  of  performance  between  the  ensemble  and  existing 
sequential  processors  is  also  desirable.  We  use  the  Sun-relative  speedup  for  this  pur¬ 
pose.  To  obtain  this  comparative  measure  we  write  one  program  in  ensemble  SIMD 
code  or  with  rewrite  rules,  and  another  in  efficient  C.  By  comparing  the  actual  per¬ 
formance  of  the  C  program  on  a  Sun  workstation  with  the  performance  of  the  SIMD 
code  on  the  ensemble  simulator,  we  obtain  for  each  problem  a  speedup  measure  called 
“Sun-relative  speedup.”  In  our  case,  we  take  a  Sun  SPARCstation  IPC  as  the  basis 
for  comparison.  This  could  also  be  used  to  assign  a  “MIPS”  rating  to  the  ensemble 
by  multiplying  this  speedup  by  the  published  MIPS  ratings  of  the  specific  Sun  work¬ 
station.  which  is  roughly  15  MIPS  for  the  SPARCstation  IPC.  In  most  cases,  the  aim 
is  to  compare  a  good  algorithm  for  a  problem  on  the  RRM  with  a  good  sequential 
algorithm  on  a  Sun.  In  some  cases,  the  optimized  sequential  Sun  version  involves 
significant  variations  from  the  algorithm  used  on  the  RRM.  When  we  discuss  each 
benchmark  below,  at  the  ensemble  level  and  levels  above,  we  mention  the  specific 
assumptions  made. 

5.3.1  Performance  Estimates  for  TAK 

The  TAK  benchmark  is  a  subtle  modification  of  the  function  Ikuo  Takeuchi  origi¬ 
nated  specifically  to  test  Lisp  systems.  The  modification  accidentally  introduced  by 
Richard  Gabriel  and  John  McCarthy  makes  the  function  more  difficult  to  optimize, 
but  preserves  its  simple,  recursion-intensive  structure.  We  have  implemented  TAK 
for  the  RRM  and  in  C  for  purposes  of  comparison.  The  Lisp  and  C  code  are  shown 
below: 


(defun  tak 

(x  y 

z) 

(if  (not 

«  y 

x)) 

z 

(tak 

(tak 

(1- 

x) 

y  z) 

(tak 

(1- 

y) 

z  x) 

(tak 

(1- 

z) 

X  y)») 
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tak(x,y,z)  register  int  x,y,z; 

{  int  rl,  r2,  r3; 
while  (1)  { 

if  (x<=y)  return  z; 

rl  =  tak(x-l ,y ,z) ; 

r2  *  tak(y-l ,z,x) ; 

r3  =  tak(z-l ,x,y) ; 

x  =  rl;  y  =  r2;  z  =  r3;  >> 

Because  our  most  detailed  simulations  are  limited  to  a  single  ensemble,  we  have 
used  the  arguments  12,8,4  instead  of  the  more  traditional  18,12,6.  The  RRM  code 
completes  this  benchmark  in  22,428  cycles,  while  the  C  version  finishes  in  .0015 
second  on  a  SPARCstation  IPC.  This  leads  to  a  Sun-relative  speedup  of  6.7  (  = 
.0015/. 00022428).  We  currently  don’t  have  cluster  or  RRM  estimates  for  this  ex¬ 
ample. 

5.3.2  Performance  Estimates  for  Numeric  Fibonacci 

A  strategy'  for  computing  numeric  Fibonacci — which  yields  a  simple  approximate 
model  for  estimating  performance — is  to  do  the  computation  directly  if  it  fits  in  one 
ensemble,  and  otherwise  apply  the  last  of  the  following  rewrite  rules  for  f  ibo 
fibo(O)  =  0 
fibo(l)  *  1 

fibo(N)  »  fibo(N-l)  +  fibo(N-2)  if  N>1 
once,  and  then  push  out  the  subcomputation  of  fibo(N-2),  to  proceed  in  parallel 
with  that  of  fibo(N-l),  which  either  may  be  done  locally  or  may  push  out  further 
subcomputations.  This  strategy  always  keeps  a  significant  subcomputation  for  the 
current  ensemble.  Detailed  ensemble  simulations  allow  quite  accurate  estimates  of 
time  required  for  n  up  to  10  (it  is  linear  in  n).  By  comparing  with  the  time  required 
to  run  the  same  algorithm  in  C  on  a  Sun  workstation,  we  obtain  a  Sun-relative 
speedup  of  6.7.  The  cost  for  larger  n  is  the  time  to  set  up  the  subcomputations,  plus 
the  maximum  of  the  cost  to  finish  the  local  subcomputation  and  the  cost  to  finish 
the  pushed  out  subcomputation,  plus  the  cost  to  finish  the  computation.  Assuming 
that  network  I/O  can  be  overlapped  with  SIMD  broadcast,  but  that  transferring  a 
simple  expression  like  f  ibo  (10)  out  of  an  ensemble  or  transferring  a  result  such  as 
2584  takes  just  a  small  number  of  SIMD  instructions,  the  complete  time  to  compute 
the  numeric  Fibonacci  cam  be  modeled  by  a  recursive  function  allowing  different 
assumptions  about  the  network  communication  delays.  For  very  fast  networks,  the 
network  communication  times  and  the  computation  times  (for  setup  and  finishing)  are 
roughly  comparable,  so  that  network  I/O  cannot  dominate  the  overall  computation 
time  (usually  it  will  be  overlapped  with  computation). 
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The  cost  ot  numeric  Fibonacci  within  an  ensemble  is  approximated  by 
fibens(n)  =  250  x  n  —  50 

for  n  >  3.  The  approximate  cost  to  compute  the  n-th  Fibonacci,  for  n  >  10,  is  then 

fibgen(n)  =  simdcost + 
max(fibgen(n  —  1), 

fibgen(n  —  2)  +  pushcost,} 

where  simdcost  is  the  SIMD  execution  cost  to  set  up  the  subcomputations,  push 
out,  pull  in,  and  finish  the  Fibonacci  computation  (approximately  300  clock  cycles), 
pushcost  is  the  cost  to  do  two  I/O  operations  (estimated  to  be  less  than  200  clock 
cycles  for  10,000  ensembles,  and  fibgen(n)  =  fibens(n)  for  n  <  10.  With  these 
estimates  the  simdcost  dominates,  I/O  is  overlapped,  and  efficiency  is  very  good.  For 
larger  n,  fibgen(n)  =  300  x  n  —  455.  For  a  10,000-ensemble  RRM,  the  predicted 
worst-case  efficiency  for  this  example  is  88%,  which  seems  quite  encouraging.  The 
idealized  Sun-relative  speedup  is  59,000. 

5.3.3  Performance  Estimates  for  Sorting 

A  simple  way  to  sort  a  sequence  of  numbers  on  an  RRM  ensemble  is  to  use  a  2D 
exchange  sort  that  uses  both  “bubblesort”  exchanges  of  consecutive  elements  of  the 
sequence  and  “shortcut”  exchanges  between  nonconsecutive  elements.  By  appropri¬ 
ate  placement  of  the  sequence  within  an  ensemble,  both  types  of  exchanges  can  be 
accomplished  by  simple,  local  transformations.  For  a  23  x  23  array  of  values  we  can 
form  a  linear  sequence  of  numbers  in  the  array  by  going  down  the  first  tile  column, 
up  the  second  column,  and  so  forth.  We  can  also  establish  horizontal  shortcut  links 
between  list  elements  that  are  adjacent  elements  of  the  same  row.  By  folding  the 
2D  array  twice,  it  is  possible  to  embed  the  array  in  an  ensemble  and  fit  a  list  with 
23  x  23  (=  529)  elements  inside  an  ensemble  in  such  a  way  that  all  links  are  direct 
neighbor-to-neighbor  connections.  The  2D  exchange  sort  algorithm  alternates  bubble 
sort  exchanges  between  consecutive  elements  in  the  sequence  with  shortcut  exchanges 
between  nonconsecutive,  but  horizontally  adjacent  elements.  For  a  list  of  length  n 
placed  in  this  manner,  the  time  to  do  a  2D  sort  within  a  single  ensemble  is  propor¬ 
tional  to  sjn,  and  requires  approximately  221  x  y/n  —  468  clock  cycles.  The  average 
number  of  instructions  for  either  the  bubblesort  or  the  shortcut  exchange  phases  is 
42,  giving  a  main  loop  size  of  84.  Comparing  with  the  time  taken  by  a  simple  quick¬ 
sort  algorithm  written  in  C  and  running  on  a  Sun  workstation  yields  a  Sun-relative 
speedup  of  127.  Uniformly  distributed  random  data  was  used  for  the  tests. 

At  the  interensemble  level,  one  can  use  the  same  pattern,  i.e.,  ensembles  in  a  mesh 
and  interchanges  in  the  long  chain  or  rows,  but  interchanges  will  always  exchange 
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the  maximal  value  from  one  ensemble  with  the  minimal  value  from  the  next.  For 
this  problem,  the  computation  within  an  ensemble  has  a  structure  different  from 
the  structure  at  the  cluster  level  and  higher.  The  simple,  fixed  connectivity  is  one 
advantage  of  this  approach;  it  should  be  possible  to  allocate  ensembles  so  that  all  I/O 
connections  are  local,  best-case  links.  The  data  would  be  broken  into  chunks,  which 
start  interchanging  data  internally  and  across  ensemble  boundaries,  in  one  of  two 
directions,  at  their  endpoints.  When  data  items  are  exchanged  across  a  boundary, 
an  item  is  pushed  out  and  another  is  pulled  in,  preserving  the  size  of  the  chunk  in 
the  ensemble.  It  seems  better  not  to  have  a  lock-step  process,  in  which  data  items 
are  always  exchanged,  but  instead  to  exchange  data  only  when  there  is  a  need,  e.g., 
when  a  new  value  has  been  interchanged  into  an  end  position. 

In  order  to  get  estimates  at  the  cluster  level,  a  special  simulator  was  written 
in  C  that  simulates  the  two-level  2D  sorting  algorithm  and  calculates  clock-count 
estimates.  Note  that,  because  of  reduction  of  the  bandwidth  through  a  cross  section 
of  the  machine,  one  expects  that  sorting  at  the  cluster  level  should  be  at  least  23 
times  slower  than  within  an  ensemble.  Since  there  are  100  ensembles  at  the  cluster 
level,  one  might  still  see  some  further  speedup;  however,  the  algorithms  are  more 
complex  and  less  efficient.  If  very  fast  neighbor-to-neighbor  connections  can  be  used 
at  the  cluster  level  (and  this  should  also  be  possible  at  the  level  of  the  RRM  as  a 
whole),  then  exchanging  data  with  a  neighbor  should  take  only  5  to  10  clock  cycles. 
The  phases  consist  of  local  plus  global  linear  exchanges,  with  an  additional  smallest 
to  largest  shortcut,  and  local  plus  global  row  exchanges.  The  additional  cost,  due 
primarily  to  communication,  of  global  operations  is  estimated  at  20  to  30  instructions. 
The  estimated  time  to  sort  in  a  cluster  was  compared  against  the  time  to  quicksort  on 
a  Sun  giving  an  estimated  Sun-relative  speedup,  for  a  100-ensemble  cluster,  of  114. 

For  a  wormhole  routing  network,  when  data  items  are  exchanged  between  two 
ensembles,  a  round-trip  message  is  required  with  estimated  time,  assuming  a  single 
hop  is  required,  of  perhaps  20  clock  cycles.  The  estimated  idealized  Sun-relative 
speedup  for  the  RRM  would  be  very  close  to  the  cluster  case.  It  is  very  possible  that 
the  network  latency  could  be  overlapped  with  other  computation,  and  the  increase  in 
the  total  computation  time  compared  with  the  cluster  case  should  not  be  more  than 
20%. 

5.3.4  Performance  Estimates  for  Fluid  Flow 

Fluid  dynamics  can  be  studied  using  a  2D  cellular  automaton  model  [13].  This 
computational  model  is  nearly  ideal  for  the  RRM,  due  to  its  very  regular  structure 
heavily  using  instructions  that  efficiently  interchange  bits  among  neighboring  cells. 
The  same  communication  pattern  could  be  used  for  many  other  2D  processing  and 
cellular  automata  problems.  In  fact,  we  have  implemented  Conway’s  game  of  Life 
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using  these  same  techniques,  and  have  achieved  similar  performance.  Many  other 
problems,  such  as  certain  vision  algorithms,  stress  analysis  and  particle  diffusion  in 
solids,  fit  this  pattern  of  computation. 

We  have  implemented  a  version  of  the  cellular  automata  approach  based  on  a 
regular  2D  hexagonal  lattice.  Each  cell  is  connected  to  its  six  neighbors  bv  links  that 
may  hold  at  most  one  particle  traveling  in  each  direction  in  each  time  step.  We  use 
unit  time  steps,  unit  particle  masses,  and  unit  velocity.  Each  particle  is  completely 
described  by  the  link  on  which  it  currently  resides,  and  all  particles  have  constant 
kinetic  energy  and  zero  potential  energy.  At  each  time  step,  particles  move  along 
their  links,  possibly  interact  with  other  particles  at  the  center  of  a  hexagonal  cell, 
and  move  to  some  other  link. 

We  have  implemented  this  mode!  using  one  RRM  cell  to  simulate  each  hexagonal 
cell  of  the  model.  Each  RRM  cell  contains  six  bits  that  encode  the  presence  or  absence 
of  outgoing  particles  on  the  links  to  its  six  neighbors.  Communication  is  handled  by 
transferring  the  six  bits  from  each  cell  to  the  appropriate  neighbor.  Computation  is 
handled  by  performing  certain  bitwise  operations  (such  as  and,  or,  equal)  and  a  form 
of  table  lookup. 

We  used  1000  iterations  of  529  hexagonal  cells  as  the  benchmark.  Assuming  that 
the  ensemble  chips  will  have  a  clock  speed  of  100  MHz,  the  whole  benchmark  should 
run  in  2.2  to  2.6  ms.  There  are  multiple  ways  to  implement  this  problem  in  C  for 
comparison.  The  fastest  implementation  we  developed  (using  register  declarations  for 
variables,  changing  the  way  table  lookup  was  handled,  moving  conditional  expressions 
out  of  the  main  loop)  ran  in  1.4  seconds.  This  results  in  a  Sun-relative  speedup  of 
between  400  and  670  for  a  single  ensemble. 

The  instruction  count  for  the  main  loop  for  this  problem  is  about  220  instructions. 
We  estimate  that  the  communication  overhead  within  a  cluster,  using  neighbor-to- 
neighbor  connections,  could  be  as  low  as  48  clock  cycles  (6  bits  x  4  cells  per  tile  x  2). 
The  transfers  of  marks  between  ensembles  can  take  place  in  12-bit  parallel  transfers 
(one  cell  for  each  tile  on  the  edge  of  the  ensemble).  This  gives  268  clock  cycles  per 
main  loop  or  2680  ns  at  100  MHz.  This  gives  a  cluster-level  performance  that  is  82% 
of  ideal  (=  220/268). 

5.3.5  Performance  Estimates  for  a  Hardware  Simulator 

It  is  possible  to  do  a  simple  kind  of  hardware  simulation  on  the  RRM  extremely 
feist.  The  code  to  simulate  two-input  NAND  and  OR  gates,  where  the  output  state 
of  a  gate  is  represented  by  the  status  of  a  specific  mark,  has  only  24  instructions. 
This  simple  simulator  cannot  simulate  arbitrary  circuits,  since  there  can  be  layout 
problems;  gates  must  be  close  to  the  gates  that  produce  their  input  signals.  For  a 
specific  very  simple  circuit,  comparison  of  the  simulations  with  a  highly  optimized 
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C  program  running  on  a  Sun  workstation  gives  a  Sun-relative  speedup  estimate  for 
an  ensemble  of  533,  and  for  an  optimized  ('  program  that  is  a  more  general  circuit 
simulator  an  estimated  speedup  of  1,500. 

The  cluster-level  performance  estimate  is  close  to  a  linear  scaleup  of  this.  Only 
a  single  mark  per  edge  cell  needs  to  be  transferred  across  the  ensemble  boundaries. 
It  seems  reasonable  to  assume  that  this  could  be  done  in  8  clock  cycles.  The  clus¬ 
ter  performance  would  then  be  75%  of  ideal  (=  24/32).  The  idealized  Sun-relative 
speedup  for  a  cluster  would  be  40.000  to  110,000. 

5.3.6  Summary  of  Performance  Estimates 

For  a  10,000-ensemble  RRM,  our  present  estimates  are  a s  follows: 

•  Raw  peak  performance:  576  trillion  operations  per  second. 

•  For  general  symbolic  applications  (the  numeric  Fibonacci  problem  is  taken  as 
a  typical  example  and  the  TAK  function  is  a  secondary  example): 

—  Ensemble  Sun-relative  speedup  is  roughly  6.7. 

-  RRM  performance  with  wormhole  network  at  88%  efficiency  gives  an  ide¬ 
alized  Sun-relative  speedup  of  59,000. 

•  For  highly  regular  symbolic  applications  (the  sorting  problem  is  taken  as  a 
typical  example): 

—  Ensemble  performance  is  a  Sun-relative  speedup  of  127. 

—  Cluster-level  performance  is  a  Sun-relative  speedup  of  114. 

—  RRM  performance  is  estimated  at  over  80%  efficiency  (relative  to  the  clus¬ 
ter  performance)  yielding  a  Sun-relative  speedup  of  over  91. 

•  For  systolic  applications  (a  2D  fluid  flow  problem  is  taken  as  a  typical  example; 
a  secondary  example  is  a  hardware  simulator): 

-  Ensemble  performance  is  a  Sun-relative  speedup  of  400  to  670. 

-  Cluster-level  performance,  which  should  be  attainable  in  practice,  is  82% 
efficiency.  This  yields  idealized  Sun-relative  speedups  of  33,000  to  55,000. 
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A  Code  of  the  Interensemble  Simulator  and  of 
Several  Benchmarks  Run  on  it 


The  abstract  interensemble  simulator  presented  below  was  instrumental 
in  our  initial  explorations  of  the  network  requirements  for  an  RRM  cluster. 
The  abstract  interensemble  simulator  essentially  consists  of  an  an  annotated 
interpreter  of  rewrite-rules.  This  interpreter  keeps  track  of  which  ensemble 
a  term  is  supposed  to  reside  in.  Given  a  strategy  for  spawning  tasks,  the 
simulator  then  ’moves’  terms  from  one  ensemble  to  another  (or  causes  newly 
allocated  cells  to  be  created  in  some  other  ensemble)  by  annotating  the  terms 
with  the  name  (location)  of  a  new  ensemble. 

Many  experiments  were  run  using  this  abstract  simulator.  For  example, 
very  preliminary  experiments  were  conducted  considering  simple  alternative 
cluster  network  configurations  (2-D  mesh,  complete  interconnect,  and  several 
bus-like  connection  layouts).  Experiments  were  conducted  with  alternate 
strategies  for  spawning  new  tasks  (allocating  new  cells  in  other  ensembles 
only  once  one  ensemble  is  nearly  full,  pushing  out  whole  subterms  when  an 
ensemble  becomes  full,  and  static  spawning  routines).  Network  bandwidth 
requirements  were  estimated  based  on  the  number  of  pointers  detected  be¬ 
tween  terms  residing  on  various  simulated  ensembles.  Some  initial  experi¬ 
ments  were  also  carried  out  regarding  alternate  formulations  of  the  rewrite 
rules. 
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B  Detailed  Description  of  the  New  Ensemble  Sim¬ 
ulator 


The  New  RISC  Ensemble  Simulator 


Tim  Winkler 


Abstract:  The  purpose  of  these  notes  is  to  document  the  new  RISC  ensemble  simulator. 


1  Introduction 

The  outline  of  the  rest  of  this  report  is  this: 

•  Overview  of  simulator. 

•  Basic  structures. 

•  Basic  simulation  process:  a  simulation  step. 

•  Instruction  set  summary. 

•  Annotated  code  for  operations. 

This  discussion  only  includes  the  basic  simulator. 

2  Overview 


The  ensemble  simulator  is  written  in  C  and  as  a  whole  consists  of 

•  risc.h  —  which  contains  the  basic  parameters,  macro  definitions,  type  definitions,  and  global  variable 
declarations. 

•  rise .  c  —  This  contains  the  n&in  routine,  si*d_execute  called  from  the  main  routine  (which  executes 
the  simd  code)  and  is  defined  in  the  user  provided  SIMD  code,  and  the  basic  SIMD  execution  routines 
(simd,  simdl,  ...).  Supporting  the  SIMD  execution  are  various  routines  in  basic. c  which  are  not 
discussed  here.  These  are  all  part  of  the  software  of  the  ensemble  and  do  not  affect  the  design  of 
the  ensemble.  Normally  the  user  of  the  simulator  will  put  data  graph  initialization  and  the  routine 
simd.execute  in  the  file  simd.c  and  link  this  with  the  rest  of  the  simulator. 

•  basic. h,  basic. c  less  basic  routines  (in  fact,  these  can  be  viewed  as  providing  SIMD  routines)  that 
are  not  discussed  here. 


3  Basic  Structures 


The  basic  structures  used  in  the  simulation  are  variables  that  represent  registers  and  wires/buses.  The  value 
of  a  variable  representing  a  bus  indicates  the  currently  asserted  state  of  the  bus.  Normally  the  value  asserted 
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should  not  persist  beyond  one  whole  clock  cycle,  and  should  only  change  once  during  a  clock  phase  (roughly 
speaking). 

Here  are  some  basic  constants.  From  risc.h: 

/*  ensemble  configuration  */ 

•define  COLIUM  12  —  Number  of  columns  of  tiles. 

•define  AOWIUM  12  —  Number  of  rows  of  tiles. 

/*  Are  row  and  col  numbers  powers  of  two  corresponding  to  ADDRSHIFTs?  */ 

•define  P0V2  0  —  See  comment  above. 

•define  TILEHUM  (COLIUM  *  ROVIUM)  —  Total  number  of  tiles. 

•define  CELLSPERTILE  4  —  Number  of  cells  per  tile. 

•define  CELLVUM  (TILEVUM  *  CELLSPERTILE)  —  Total  number  of  cells. 

•define  PORTMUM  (2*TILEIUM  +  COLIUM  +  ROVIUM)  —  Number  of  ports. 

•define  PORTCIT  4  —  Number  of  ports  at  cell. 

Here  are  the  central  type  defintions.  Note:  the  ids  are  not  used  in  the  simulation  process  and  were 
included  to  make  structures  self-identifying. 

/*  type  definitions  */ 

/*  only  used  to  get  at  id  (really)  */ 

typedef  struct  generiestruct  {  —  Any  structure  below  looks  like  this. 
int  genid; 

int  genval;  /*  only  because  is  a  common  case  */ 

>  genericthing; 

typedef  struct  port  struct  {  —  Represents  a  port  bus. 
int  portid; 
int  portaddr; 
int  portval; 
int  portloc ; 

>  portbus; 

typedef  struct  colstruct  {  —  Represents  a  column  bus. 
int  colid; 
int  colval; 

int  colowner;  /*  need  not  exist  in  actual  hardware  */ 

>  colbus; 

•define  FAILFLAG  (1«30) 

typedef  struct  rowstruct  {  —  Represents  a  row  bus. 
int  rowid; 
int  rowval; 

int  rowowner;  /*  need  not  exist  in  actual  hardware  */ 

>  rowbus; 

typedef  struct  cellstruct  {  —  Represents  a  cell. 
int  self;  /*  own  address  */ 
int  regCREGCIT];  — Includes  ACC. 
int  marks; 

portbus  *ports [PORTCIT] ; 
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colbus  *col; 
rosbus  *ros ; 

int  portno;  —  Value  0.7;  unique  for  each  cell  on  port  bus. 
int  rate;  —  Reference  count. 

mt  aux;  —  Used  in  SELCELL  MAKEREQ  FETCHDATA,  etc. 
int  state;  —  The  state  flags  (see  below). 

>  cell state; 

/*  some  defines  related  to  the  above  */ 

•define  CELLTOK  regCTOK] 

•define  CELLFLAGS  reg [FLAGS] 

•define  CELLLEFT  reg [LEFT] 

•define  CELLRIGBT  reg [RIGHT] 

•define  CELLITOK  reg[ITOK] 

•define  CELL1 FLAGS  reg[lFLAGS] 

•define  CELLILEFT  reg [ILEFT] 

•define  CELLIRIGHT  reg [IRIGHT] 

•define  CELLACC  reg [ACC] 

•define  PORTE  ports [EAST] 

•define  PORTI  ports [IORTH] 

•define  PORTV  ports [VEST] 

•define  PORTS  ports [SOUTH] 

•define  STATE.VALID  1 

/*  VALID  Beans  allocated  and  allowed  to  become  active  */ 

•define  STATE.ACTIVE  (1«1) 

•define  STATE_TRYIIG  (1«2) 

•define  STATE_REMOTE_LEFT  (1«5) 

•define  STATE_REHOTE_RIGHT  (1«6) 

•define  STATE.COMMIT  (1«7) 

typedef  struct  simd struct  •(  —  SIMD  broadcast  bus. 
int  id; 

int  op;  —  Current  operation /opcode. 

int  argl,arg2,arg3>arg4;  —  arguments  to  operation. 

int  ph;  —  Clock  phase,  1  or  2. 

>  simdbus; 

There  are  32  marks  in  a  cell,  referred  to  by  index  (0-31).  There  may  be  some  additional  state  bits 
WORKING,  ZERO,  MIHUS,  CARRY,  OVERFLOW,  and  FAILURE.  The  aux  register  might  be  the  ACC. 

Here  are  the  actual  declarations  of  the  key  shared  variables. 

extern  int  clock;  /*  *ote:  not  phase  number  */ 

extern  int  globalfeedback; 

extern  simdbus  simdinstr; 

extern  portbus  port [PORTIUM] ; 

extern  colbus  col [COLHUH] ; 
extern  rovbus  rov[R0WIUM] ; 

extern  cellstate  cell [CELLIUM] ; 
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int  randoms t ate ; 


4  The  Simulation  Process 


The  user  code  calls  the  siad,  simdl,  ...  routines,  and  also  uses  the  resetglob  and  testglob  routines;  these 
routines  give  control  the  the  ensemble.  The  user  SIMD  routines  may  also  use  whatever  controller  state  is 
desired  (e.g.  flags). 

Execution  of 

s iad2 ( TSTMARKPTR , 1 , LEFT ) ;  SHOW; 

causes  the  opcode  (TSTMARKPTR)  and  arguments  (l.LEFT)  to  be  put  into  the  siadinstr  structure  that 
represents  the  SIMD  control  broadcast  bus,  and  then  the  execution  of  siadstap.  There  are  many  extranenous 
details  to  siadstap  (e.g.,  recording  frequency  of  execution  of  instructions),  but  the  key  process  is  to 

•  Increment  the  clock. 

•  Reset  port,  row,  and  col  buses  (if  needed). 

•  Set  the  instruction  phase  to  1  (siadinstr  .ph  =  l). 

•  Perform  siadaction  for  every  cell. 

•  In  the  case  of  CELLARBPTR  and  ARBROV,  perform  some  inter-clock  phase  actions  that  cannot  easily  be 
associated  with  cell  actions. 

•  Set  the  instruction  phase  to  I  (siadinstr  .ph  =  2). 

•  Perform  siadaction  for  every  cell. 

The  siadaction  routine  takes  a  pointer  to  cell  and  uses  a  large  switch  statement  to  execute  the  code 
associated  with  the  SIMD  op. 


5  Basic  definitions 


The  following  are  very  basic  bit  operations: 

/*  general  defines  */ 

—  The  are  bit  testing  operations  based  on  bit  indices. 

#def ine  TSTBITK(x.n)  ((x)fc(l«(n))) 

•define  ADDBITH(x.n)  ( (x)  i  (l«(n) ) ) 

•define  REMBITM(x.n)  ((x)A  -(l«(n)>) 

•define  ASIBITI(x,n,y)  ((y)?ADDBITI(x,n) :REMBITI(x,n)) 
•define  HSKBITI(n)  (l«(n)) 

•define  SETBITH(x.n)  ((x)  1=  (l«(n))) 

•define  CLRBITlKx.n)  ((x)  *=  '(l«(n))) 

•define  LOWERBITS(n)  ((l«(n))-l) 

—  The  are  bit  testing  operations  based  on  bit  masks. 
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♦define  TSTBIT(x.n)  ((x)A(n)) 

♦define  ADDBIT(x.n)  ((x)l(n)) 

♦define  REMBIT (x,n)  ((x)A('(n))) 

♦define  ASIBIT(x,n,y)  ((y)?ADDBIT(x,n) :REMBIT(x,n)) 

♦define  SETBIT(x.n)  ((x)  1=  (n)) 

♦define  CLRBIT(x.n)  ((x)  A=  _(n)) 

/*  Rote:  duplication  of  n  in  following:  */ 

♦define  TST ALLBIT (x,n)  ((n)  ==  ((x)A(n))) 

♦define  LOWESTBIT(n)  ((n)A-(n)) 

♦define  ATMOSTOMEBIT(n)  ((n)A’-(n)) 

♦define  UMDEF  (Oxbadb&dee) 

The  following  are  the  basic  definitions  related  to  addresses,  address  mappings,  port  selection,  and 
PORTNO  calculations. 


/*  basic  defines  */ 

/*  Addresses:  top  left  is  0,0  (is  X,Y);  X  is  roe  and  Y  is  col  */ 

/*  Y=0  1  2  3  4  5  6  T  */ 

/*  X=0  0  1  2  3  4  5  6  7  */ 

/*  1  8  9  10  11  12  13  14  16  */ 

/*  2  16  17  18  ....  */ 

♦def  ine  ADDRSHIFT  4  —  Bits  to  represent  column  or  row. 

♦define  ADDRTOPSHIFT  8 
/*  should  be  2* ADDRSHIFT  */ 

/*  Changed  cell  nums  by  +1  in  addrs  */ 

—  Compute  address  from  row,  col,  cell  num. 

♦define  ADDR(x.y.n)  ((((n)+l)  «  ADDRTOPSHIFT)  +  ((x)«ADDRSHIFT)+(y)) 

—  Tile  is  just  row, col,  no  cell  number. 

♦define  TILE(x)  ((x)  *  ((1«ADDRT0PSHIFT)-1)) 

—  Decode  addresses. 

♦define  ROW(x)  (((x)»ADDRSHIFT)  A  ((1«ADDRSHIFT)-1)) 

♦define  COL(x)  ((x)  A  ((1«ADDRSHIFT)-1)) 

♦define  IUM(x)  (((x)  »  ADDRTOPSHIFT) -1) 

♦define  CELLHUMIICR  (1«ADDRT0PSHIFT) 

♦define  TILEADDR(x.y)  (((x)«ADDRSHIFT)+(y)) 

—  Translating  addresses  to  indices. 

♦if  P0W2 

♦define  CELLIHD(x,y ,n)  (((n)  «  ADDRTOPSHIFT)  +  ((x)«ADDRSHIFT)+(y)) 
/*  convert  address  to  cellD  index  */ 

♦define  ADDRTOIHD(x)  ((x) -CELLHUMIICR) 

♦define  TILEIHD(x.y)  (((x)«ADDRSHIFT)  +  (y)) 

♦else 

♦define  CELLIID(x,y ,n)  (((n)*R0WIUM  +  (x))*C0LIUM  +  (y)) 

/*  convert  address  to  cellD  index  */ 

♦define  ADDRTOIHD(x)  ( (HUM(x)*R0WHUM  +  ROH(x) )*C0LIUM  +  COL(x)) 
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Adeline  TILEIID(x.y)  (((x)*COL*UM)+(y)) 

#endii 

Adeline  ROWIICR  (1«ADDRSHIFT) 

Adeline  COLI1CR  1 

Adeline  ADDRUP(x)  ((x)-ROWIlCR)  —  Adjacent  tiles. 

Adeline  ADDRDWI(x)  ( (x)+ROWIICR) 

Adel ine  ADDRLFT(x)  ((x)-COLIICR) 

Adeline  ADDRRGT(x)  ((x)+COLI*CR) 

Adel ine  CELLAT(a)  cell[ADDRTOIID(a)]  —  Address  to  cell. 

Adeline  CELLREF(x,y ,n)  cell[CELLIID(x,y ,n)]  —  r,c,n  to  cell. 

—  Indices  for  ports. 

Adeline  PORTIIID(x.y)  (2*TILEIID(x,y)) 

Adeline  PORTVIID(x.y)  (2*TILEIID(x,y)+l) 

Adeline  ISPORTW(x)  ((x)*l) 

/*  only  when  not  on  right  or  bottom  edge  */ 

Adeline  PORTEIID(x)  (2*TILEIUM+(x)) 

/*  along  right  side  */ 

Adeline  PORTSIID(y)  (2*TILEIUM+ROWIUH+(y) ) 

/*  along  bottom  */ 

/*  access  to  all  ports  by  allowing  "the  next  index  over"  */ 

Adeline  PORTIIIDX(x,y)  <(x)==ROWIUM  ?  PORTSIID(y)  :  PORTIIID(x,y)) 
Adeline’ PORTWIIDX(x.y)  ((y)==COL*UM  ?  PORTEIID(x)  :  PORTtfllD(x.y)) 

/*  not  really  correct:  */ 

Ail  P0W2 

Adeline  VALIDADDR(x)  (CELLHUMIICR<=(x)  kk  \ 

(x ) < (CELLIUM+CELLMUMIICR) ) 

Aelse 

Adeline  VALXDADDR(x)  (ADDR(0,0,0)<=(x)  kk  \ 

( x ) < ADDR ( ROWIUM- 1 , COLVUM- 1 , CELLSPERTILE- 1 )  kk  \ 

( COL C x ) <COLHUM )  kk  (ROW(x)<ROVIUH) ) 

Aendil 

Adeline  VALIDDIR(x)  (0<=(x)  kk  (x)<=  3) 

/*  a  pair  o 1  tileaddrs  */ 

/*  Try  to  treat  addr  as  local  when  same  tile?  */ 

Adeline  ISLOCAL(x.y)  (((x)<=(y))?  ((COLIICR=*((y)-(x))  kk  C0L(y)!=0)  \ 

II  ROWIMCR==((y)-(x)))  :  \ 

((COLIHCR==((x)-(y))  Aft  C0L(x)!=0)  II  ROWIICR==((x)-(y)))) 

—  Compute  direction  of  y  from  z  (or  undef). 

Adeline  WHICHDIR(x.y)  \ 

( (COLINCR==( (y)-(x)) )?( (C0L(y)==0)?  UIDEF  :  EAST):  \ 
(ROWIICR==((y)-(x)))?SOUTH:  \ 

(-C0LIICR==((y)-(x)))?((C0L(x)==0)?  UIDEF  :  WEST):  \ 

(-ROWIICR==( (y)-(x) ) )?IORTH:  UIDEF) 
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/*  addr  ->  portno  */ 

♦define  POKTIO(a)  ((l*((aX((a)»ADDRSHIFT)))?IUH(a)+4:HUM(a)) 

♦define  PORTIOSHIFT  8 
♦dal in*  PORTIOCIT  8 

♦define  PORTIOCASE(a)  ( (a)<CELLSPERTILE) 

♦define  PORTHOSELBIG(a.b)  (((aXCELLSPERTILE)  ?  ( (b)»PORTIOSHIFT)  :  \ 

( (b)ftLOVERBITS(PORTIOSHIFT ) ) ) 

♦define  PORTIOSEL(a.b)  ((a)  ?  ((b)»CELLSPERTILE)  :  \ 

( (b)ALOWERBITS(CELLSPERTILE) ) ) 

♦define  P0RT10REPLY(a,b)  ( ( (aXCELLSPERTILE)  ?  ( (b)«CELLSPERTILE)  :  (b)) 
/*  ROW  and  COL  oork  for  PORTADDRs  too  */ 

♦define  PORTADDR(x,y ,d)  (((d)  «  ADDRTOPSHIFT)  +  ((x)«ADDRSHIFT)+(y)) 
♦define  PORTDIR(x)  ((x)  »  ADDRTOPSHIFT) 

/*  number  of  registers  in  one  set  */ 

♦define  REGIUM  4 

These  are  some  simple  defines  used  to  make  the  SIMD  code  a  bit  more  readable. 


/*  registers  */ 

♦define  TOK  0 
♦define  FUGS  1 
♦define  LEFT  2 
♦define  RIGHT  3 
♦define  ITOK  4 
♦define  VFUGS  5 
♦define  ILEFT  6 
♦define  IRIGHT  7 
♦define  ACC  8 
♦define  REGCHT  9 

/*  directions  */ 

♦define  EAST  0 
♦define  IORTH  1 
♦define  WEST  2 
♦define  SOUTH  3 

In  fact,  it  is  possible  to  use  16  registers  by  changing  REGCHT  to  16.  One  may  also  want  to  adjust  REGIUM. 


6  Instruction  Set  Summary 

Specifiers  for  arguments  to  SIMD  instructions: 

•  val:  16  (actually  32  bit)  value 

•  mark:  0-31  bit  position 

•  reg,  ptr,  source-ptr,  target-ptr,  sreg,  treg,  source,  target:  0-8  register  selector 
May  be  0-15  instead. 

•  dir:  0=EAST,  l=NORTH,  2=WEST,  3=SOUTH 

•  portno:  0-7  select  active  element  on  port 
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Basic  instructions 


■OP  —  Do  nothing. 

HIT  —  Activate  all  allocated. 

ERASE  —  Clear  marks. 

COIST  val  —  val  to  ACC. 

CLEAR  reg  —  0  to  reg. 

EQ  reg  reg  —  Various  tests. 

IEQ  reg  reg 
GT  reg  reg 
LE  reg  reg 
LT  reg  reg 
GE  reg  reg 
TSTZERO  reg 
TSTRZERO  reg 

HOVE  treg  sreg  —  From  source  to  target. 

ADD  treg  sreg 
SUB  treg  sreg 
LOGNOT  reg 
LOGAID  treg  sreg 
LOGIOR  treg  sreg 
LOGXOR  treg  sreg 

(For  these  next,  probably  aant  to  use  a  state  bit  tor  bits  shifted  out) 
ROTATEL  treg  sreg  —  1 6-bit  rotate  left. 

ROTATER  treg  sreg  —  16-bii  rotate  right 
SHIFTL  treg  sreg  —  Shift  left.. 

SHIFTR  treg  sreg  —  Shift  right. 

SHIFT RA  treg  sreg  —  Shift  right  arithmetic  (preserve  sign). 

SETRAIDOM  treg  —  Put  in  16  (82)  bit  psuedo-random  value. 

TSTMARK  mark  —  Local  mark  operations. 

TSTIOTMARK  mark 
SETMARK  mark 
CLRMARK  mark 

TSTHARKPTR  mark  source-ptr  —  Adjacent  tile  mark  operations. 

TSTIOTMARKPTR  mark  source-ptr 
SETMARKPTR  mark  target-ptr 
CLRMARKPTR  mark  target-ptr 

IICR  val  —  Increment  reference  count  of  active  cell, 
val  can  be  +1,-1 

IICRPTR1  val  target-ptr  —  RISC  incrptr  part  1  (cells  0,1). 

IICRPTR2  val  target-ptr  —  RISC  incrptr  part  2  (cells  2,8). 

SETTRYIIG  —  Set  stole  trying  flag. 

Performs  global  feedback  on  any  active. 

HITTRYI1G  —  Equiv.  to:  IMIT,  test  TRYHG  flag. 
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Performs  global  feedback  on  any  active. 

CLRTRYIIG  —  Clear  stale  trying  flag. 

ARBPORTPTR  rag  —  Arbitrate  for  port  based  on  pointer, 
clear  TRYING  if  succeed  (deactivate  if  non-local). 

ARBPORT  dir  —  Arbitrated  on  given  port  clear  TRYIIG  if  succeed. 

MAKEREQ  ptr  —  RISC  fetch  part  1;  uses  auz. 

May  want  a  version,  MAKEREQDIR  dir,  that  goes  in  a  certain  direction 
(not  defined  yet). 

FETCHDATA  treg  sourca-ptr  srag  —  RISC  fetch  part  2. 

STORE  target -ptr  treg  srag  —  RISC  store  (assumes  arbitration ). 

[SHOULD  BE  SPLIT  INTO  MAKEREQ  and  STOREDATA.] 

STOREPI  portno  target-ptr  treg  srag  —  Porino  activation,  no  arb.  needed. 
FETCHPV  portno  trag  sourca-ptr  srag  —  Portno  activation,  no  or 6.  needed. 

CELLARBPTR  target -ptr  —  After  ARBPORTPTR,  do  cell  arbitration  (not  used). 

ALLOCREQ  —  RISC  alloc  part  1;  uses  oux. 

AVAIL1  rag  —  RISC  alloc  part  2  (cells  0,1). 

AVAIL2  rag  —  RISC  alloc  part  S  (cells  2,8). 

ALLOCPTR  target-ptr  —  RISC  alloc  part  f;  allocate  poinied-al  cell. 

SEHDGLOBAL  —  Initiate  global  feedback. 

resetglob()  —  CONTROLLER  CODE:  reset  global  feedback  indicator. 
testglobaO  —  CONTROLLER  CODE:  test  global  feedback  indicator. 

ROW /COL  operations 

ARBTILE  —  ROW/COL  part  1;  equiv.  to  ARBPORT  IORTH,  but  doesn’t  affect  TRYIIG. 
ARBCOL  —  ROW /COL  part  2. 

SELROW  target  —  ROW /COL  pari  8. 

ARBROW  target  —  ROW /COL  part  4- 
SELCELL  ptr  —  ROW /COL  part  5. 

FETCHGLBL  target  sourca-ptr  rag  —  Global  actiions. 

STOREGLBL  target-ptr  rag  source 
TSTMARKGLBL  sourca-ptr  mark 
TSTNOTMARKGLBL  mark  sourca-ptr 
SETMARKGLBL  mark  targat-ptr 
CLRMARKGLBL  mark  targat-ptr 
IHCRGLBL  val  targat-ptr 

Other  instructions 

SWAP  —  Switch  TOK,  ITOK,  etc. 

COMMIT  —  SWAP; set  COMMIT  flag. 

COMMITMARK  mark  —  If  mark,  then  SWAP;  always  set  COMMIT  flag. 

SETCOMMIT  —  Set  COMMIT  flag. 

TSTSTATE  flags  —  Operations  on  state  flags  (use  masks  not  bit  positions). 
TSTHOTSTATE  flags 
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SETSTATE  flags 
CLRSTATE  flags 

TSTLOCAL  ptr  —  Test  if  local  pointer. 

TSTREMOTE  ptr  —  Test  if  remote  pointer. 

TSTADDR  ptr  —  Test  if  appears  to  be  an  address. 

TSTIOTADDR  ptr  —  Test  if  appears  not  to  be  an  address. 

TSTIIUSE  —  Test  is  reference  count  is  positive  (i=l). 

TSTUIUSED  —  Test  if  reference  count  is  not  positive  (i=0). 
TSTUISHARED  —  Test  if  reference  count  is  1. 

RELOCREQ  ptr  —  If  local,  succeed,  otherwise  start  relocation. 

(Set  REMOTE,  flags.) 

DELETE  —  Clear  registers,  marks,  and  state. 

IIITALL  —  Inti  all  including  unallocated. 

TSTBLACK  —  Checkerboard  activation. 

TSTRED  —  Checkerboard  activation. 

TSTCLASS  class  —  One  of  5  cases  activation,  class:  0-4. 
TSTCELLIUM  cellnum  —  activate  based  on  cell  number:  0-8. 


7  Annotated  Code 


The  following  is  the  key  simulator  code  with  some  annotations. 

This  is  the  initialization  routine.  The  ids  of  objects  are  filled  in  (although  these  are  not  currently  used). 
The  central  function  is  to  link  up  the  cells  with  the  port  buses. 

ensemble.init ( ) 

register  int  i,j,n,a; 
cellstate  *cp; 

simdinstr . id  =  0; 
s indins tr. op  =  I0P; 
simdinstr . argl  =  UIDEF; 
simdinstr. arg2  =  UIDEF; 
simdinstr. arg3  =  UIDEF; 
simdinstr. arg4  =  UIDEF; 

for  C i— 0 ;  i<P0RTIUM;  i++)  { 
port [i] .port  id  =  MKID(PORT.i) ; 
port [i] . portloc  =  0; 

> 

for  (i=0;  i<R0¥IUM;  i++)  rowCi] .rowid  =  MKID(R0US,i) ; 
for  (i=0;  icCOLMUM;  i++)  col[i]. colid  =  MKlD(COLS.i) ; 

for  (i=0;  i<R0HIUM;  i++) 
for  (j=0;  j<C0LIUM;  j++)  i 

for  (n=0;  n<CELLSPERTILE ;  n++)  { 
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a  =  ADDR(i, j ,n) ; 
cp  =  ACELLAT(a) ; 
cp->seli  =  HKID(CELL.a); 

/*  used  shea  all  calls  on  a  port  naad  to  usa  a  bit,  ate.  */ 
cp->portno  =  n  ♦  ((lA(i+j))  1  CELLSPEETILE  :  0); 
cp->ro»  =  Arov[i]  ;  —  Row  and  col  are  easy. 

cp->col  =  Acol[j]; 

—  Ports  are  more  complex. 
cp->P0RT*  =  ^port [PORTIIRDX ( i , j ) 3 ; 
cp->P0RT*->portloc  =  P0RTADDR(i, j .I0RTB) ; 
cp->P0RTW  =  Aport [P0RT¥IIDX(i, j )] ; 
cp->P0RTW->portloc  =  P0RTADDR(i, j .WEST) ; 
cp->P0RTE  =  Aport [P0RTWIIDX(i, j+l)3 ; 

it  (j+1  ==  COLIUM)  cp->P0RTE->portloc  *  PORT ADDR  ( i , j , EAST ) ; 
cp->P0RTS  =  Aport CPORTJIIDKi+l,  j )]  ; 

if  (i+1  ==  ROWIUM)  cp->PORTS->portloc  =  P0RTADDR(i,j .SOUTH) ; 
cp-> narks  =  0; 
cp->rafc  =  0; 
cp->stata  =  0; 

/*  {  int  a;  lor  (n=0;  »<REGCIT;  »++)  cp->rag[n]  =  0;  >  */ 

> 

> 

> 

Here  are  the  basic  SIMD  broadcast  routines.  Simply  copy  arguments  to  sindinstr  structure.  (Would 
have  liked  a  somewhat  different  approach  to  variable  numbers  of  arguments  than  provided  in  C.) 

sind(op) 
int  op; 

sindinstr. op  =  op; 
sindinstr. argl  *  UIDEF; 
sindinstr .arg2  «  UIDEF; 
sindinstr. arg3  =  UIDEF; 
sindinstr. arg4  *  UIDEF; 
sindstepO ; 

> 

sindl(op.arg) 
int  op,arg; 

sindinstr. op  =  op; 
sindinstr.argl  =  arg; 
sindinstr .arg2  =  UIDEF; 
simdinstr.arg3  *  UIDEF; 
sindinstr .arg4  =  UIDEF; 
simdstepO ; 

> 

sind2(op,argl,arg2) 
int  op.argl ,arg2; 

•C 

sindinstr. op  =  op; 
sindinstr.argl  =  argl; 
sindinstr. arg2  =  arg2; 
sindinstr . arg3  =  UIDEF; 
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siadinstr . arg4  -  UIDEF; 
siadstepO ; 

) 

siad3(op,argl,arg2,arg3) 
int  op,argl,arg2,arg3; 

{ 

siadinstr. op  =  op; 
siadinstr.argl  =  argl; 
siadinstr . arg2  =  arg2; 
siadinstr . arg3  =  &rg3; 
siadinstr. arg4  =  UIDEF; 
siadstepO ; 

> 

s imd4 ( op , arg 1 , arg2 , arg3 , srg4 ) 
int  op,argl,arg2,arg3,arg4; 

siadinstr. op  =  op; 
siadinstr.argl  -  argl; 
siadinstr. arg2  «  arg2; 
siadinstr. arg3  =  arg3; 
siadinstr. arg4  =  arg4; 
siadstepO ; 


This  is  the  core  SIMD  execution  routine  (the  process  of  execution  has  already  been  briefly  discussed). 
The  important  state  changes  involve  clock,  siadinst.ph,  and  the  commnication  buses. 

siadstepO 

{ 

register  int  i.op.p.flg.val; 

clock++;  —  Update  clock. 

op  =  siadinstr. op;  —  Get  op. 
if  (recordinstrfreq)  { 

if  (op  !=  HIT)  instrfreq[op]++; 

> 

if  (shonin8trs)  { 
printf(“%d:  ".clock); 

printopO;  printf("\n") ;  fflnsh(stdout) ;  > 
if  (op  !=  IOP)  { 

/*  Actually  there  seen  to  be  cases  here  */ 

/*  arbcol,  arbrov  are  actions  over  different  entities  */ 

/*  Could  improve  on  this  */ 

if  (op== ARBCOL)  —  Conditionally  reset  row/col  buses. 

for  (i=0;  i<COLSUM;  i++)  {  col [i] . colval  =  0;  col [i] . colovner  =  0;  > 
else 

if  (op==SELR0W) 

for  (i=0;  i<C0LHUM;  i++)  col [i] . colval  =  UIDEF; 

else 

if  (op==SELCELL) 

for  (i=0;  KCELLIUM;  i++)  cell[i].aux  =  0; 
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—  Always  reset  port  buses. 

1 or  (i=0;  KPQRTIUH;  i++)  port [i] .port val  =  0; 
lor  (i=0;  i<P0RTIUM;  i++)  port [i] .port addr  =  UIDEF; 

/*  il  (verbose)  print!  ( "phaselW ) ;  */ 

simdinstr.ph  =1;  —  Perform  phase  1,  all  cells. 

lor  (i=0;  KCELLIUM ;  i++) 
simdaction(ftcell [i] ) ; 

/*  il  (verbose)  print! ("port \n") ;  */ 

/*  In  the  lolloping  could  put  checks  to  see  il  bus  already  non-UIDEF 
which  would  indicate  an  error  in  use;  this  also  ignores  issues 
ol  cell  arb  */ 

il  (op  ==  FETCH)  {  —  Not  currently  used. 
lor  (i=0;  icPORTIUH;  i++) 

il  (port [i] .portaddr  !=  UIDEF) 

port [i] .portval  =  CELLiT (port [i] .portaddr) .regCsirndinstr . arg3] ; 

> 

else 

il  (op  ==  CELLARBPTR)  { 

lor  (i=0;  i<CELLIUH ;  i++)  { 
llg  =  0; 

val  =  MSKBITKcell  [i]  .portno) ; 
lor  (p=0;  p<P0RTCIT ;  p++)  { 

il  (TSTBIT(cell Ci] .ports  [p] ->portval , val) )  { 
il  (llg)  CLRBIT(cell[i] .ports [p]->portval, val); 
fig  =  1; 

> 

> 

> 

> 

else 

il  (op  ==  ARBROW)  < 

lor  (i=0;  i<R0VIUN;  i++)  {  row[i] .rowval  =  UIDEF;  row[i] .rowowner  =  0;  > 
lor  (p=C0LIUM-l;  0<=p;  p--)  { 
it  (0  !=  col [p] . colowner)  ■( 
val  =  col [p] . col val ; 
il  (val  !=  UIDEF)  { 
i  *  row [val] . rowowner ; 

il  (i  !=  0)  SETBIT(col[COL(i)] . colval.FAILFLAG) ; 
row [val] . rowowner  =  col [p] .colowner; 

> 

> 

> 

> 


/*  il  (verbose)  print! ("phase2\n") ;  */ 
simdinstr.ph  =2;  —  Perform  phase  2,  all  cells, 

tor  (i=0;  i<CELLIUM;  i++) 
sindaction(kcell [i] ) ; 

> 
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> 


This  routine,  simdaction,  is  the  heart  of  it  all.  It  gets  a  pointer  to  a  cell  and  uses  the  clock  phase  and 
the  SIMD  broadcast  to  perforin  the  appropriate  action  on  the  cell  and  busses.  Here  is  the  prolog.  The  large 
snitch  statement  is  discussed  below.  Note  the  “local”’  macros  that  are  introduced. 

simdaction(cp) 
register  cellstate  *cp; 

< 

register  int  op,ph,st,r,val; 
int  dir , tile, nask, a, fig; 

/*  Do  some  top-level  case  analysis  depending  on  nhether  active,  etc.  */ 
st  -  cp->state; 

♦define  DEACTIVATE  CLRBIT(cp->*tate,STATE_ACTIVE) 

♦define  ACTIVATE  SETBIT(cp->state,STATE_ACTIVE) 
op  =  simdinstr .op; 
ph  =  simdinstr .ph; 

♦define  IS.ACTIVE  TSTBIT(st,STATE_ ACTIVE) 

♦define  IS_ VALID  TSTBIT(st, STATE. VALID) 

snitch  (op)  { 

—  All  the  case  are  discussed  below. 

Some  cases  in  the  snitch  are  omitted.  The  various  operations  are  discussed  below. 

Here  is  a  large  group  of  relatively  trivial  operations  that  are  not  discussed  in  detail.  These  instructions 
consist  of  simple  local  operations  and  tests. 

case  HOP:  break; 
case  HIT: 

if  (ph==2  ftft  TSTBIT(st, STATE. VALID))  ACTIVATE; 
break; 
case  ERASE: 

if  (ph==2  kk  IS. ACTIVE)  cp->marks  =  0; 
break; 

case  TSTMARK :  /*  mark  */ 

if  (ph==2  kk  IS.ACTIVE  kk  ! TSTBITI(cp- >marks, simdinstr. argl)) 

DEACTIVATE; 

break; 

case  TSTIOTMARK :  /*  mark  */ 

if  (ph==2  kk  IS.ACTIVE  kk  TSTBITl(cp->marks, simdinstr. argl)) 

DEACTIVATE; 

break; 

case  SETMARK:  /*  mark  */ 
if  (ph==2  Aft  IS.ACTIVE) 

SETBITI(cp->marks .simdinstr . argl) ; 
break; 

case  CLRMARK:  /*  mark  */ 
if  (ph==2  ftft  IS.ACTIVE) 

CLRBITM(cp->marks , simdinstr . argl ) ; 
break; 

case  COHST:  /*  val  */ 
if  (ph==2  ftft  IS.ACTIVE) 

cp->CELLACC  =  simdinstr. argl; 
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break; 

ease  CLEAR:  /*  rag  */ 
il  (ph*s2  kk  IS.ACTIVE) 

cp->reg [siadinstr .argl]  =  0; 

break; 

case  EQ:  /*  reg  reg  */ 
il  (ph==2  k*  IS.ACTIVE  Aft 

! ( ( cp->reg [siadinstr . argl] ) ==( cp->rag [s iadinstr . arg2] ) ) ) 
DEACTIVATE; 
break; 

ease  IEQ:  /*  reg  reg  */ 
il  (ph**2  kk  IS.ACTIVE  kk 

! ( (cp->reg [s iadinstr . argl] ) ! = ( cp->r eg [s iadinstr . arg2] ) ) ) 
DEACTIVATE; 
break; 

case  GT:  /*  reg  reg  */ 
il  (ph==2  kk  IS.ACTIVE  kk 

! ( ( cp->r eg Cs iadinstr . argl] ) > ( cp->reg [siadinstr . arg2] ) ) ) 
DEACTIVATE; 
break; 

case  LE:  /*  reg  reg  */ 
il  (ph==2  kk  IS.ACTIVE  kk 

! ( ( cp->reg [s iadinstr . argl] ) <= ( cp->r eg [siadinstr . arg2] ) ) ) 
DEACTIVATE; 
break; 

case  LT:  /*  reg  reg  */ 
il  (ph==2  kk  IS.ACTIVE  kk 

! ( ( cp->reg [siadinstr . argl] ) < ( cp->reg [siadinstr . arg2] ) ) ) 
DEACTIVATE; 
break; 

case  GE:  /*  reg  reg  */ 
il  (ph==2  kk  IS.ACTIVE  kk 

!  ((cp->reg [siadinstr . argl] )>=(cp->r eg [siadinstr . arg2] ) ) ) 
DEACTIVATE; 
break; 

case  TSTZERO:  /*  reg  */ 

il  (ph==2  kk  IS.ACTIVE  **  ! ((cp->reg[siadinstr.argl])==0)) 
DEACTIVATE; 
break; 

case  TSTVZERO:  /*  reg  */ 

il  (ph==2  kk  IS.ACTIVE  kk  ( (cp->r eg [s iadinstr. arg 1] )==0)) 
DEACTIVATE; 
break; 
case  MOVE: 

il  (ph==2  kk  IS.ACTIVE) 

cp->reg [siadinstr. argl]  =  cp->reg[siadinstr.arg2] ; 
break; 

case  ADD:  /*  treg,  sreg  */ 
il  (ph==2  kk  IS.ACTIVE) 

cp->reg [siadinstr. argl]  +=  cp->reg [s iadinstr. arg2] ; 
break; 

case  SUB:  /*  treg,  sreg  */ 
il  (ph==2  kk  IS.ACTIVE) 

cp->reg[siadinstr.argl]  -=  cp->reg [siadinstr . arg2] ; 
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break; 

ease  LOGIOT :  /*  rag  */ 
if  (ph=s2  Aft  IS.ACTIVE) 

cp->reg[siadinstr .argl]  =  *cp~>reg[si»dinatr. argl] ; 

break; 

case  LOGAID:  /*  treg,  sreg  */ 
if  (ph==2  kk  IS.ACTIVE) 

cp->reg[si»dinstr .argl]  A=  cp->reg[sisKlinatr .arg2] ; 

break; 

case  LOGIOR:  /*  treg,  sreg  */ 
if  (ph==2  Aft  IS.ACTIVE) 

cp->reg[siadinstr.argl]  Is  cp->reg[simdinstr .arg2] ; 
break; 

case  LOGXOR:  /*  treg  sreg  */ 
if  (ph==2  kk  IS.ACTIVE) 

cp->reg[sindinstr.argl]  *=  cp->reg[siadiastr .arg2] ; 
break; 

case  ROTATEL:  /*  treg  sreg  */ 
if  (ph==2  kk  IS.ACTIVE)  < 

/*  This  is  specifically  a  16  bit  operation  and  doesn't  ase 
a  condition  flag  */ 
val  =  cp->reg[si«dinstr.arg2] ; 

yal  =  ((val  «  1)  *  Oxffff)  ♦  (val  *  0x8000  ?  1  :  0); 
cp->reg[sindinstr.argl]  =  val; 

> 

break; 

case  ROTATER:  /*  treg  sreg  */ 
if  (ph==2  kk  IS.ACTIVE)  { 

/*  This  is  specifically  a  16  bit  operation  and  doesn't  use 
a  condition  flag  */ 
val  =  cp->reg[sindinstr.arg2] ; 

val  =  ((val  »  1)  k  0x7fff)  +  (val  *  1  ?  0x8000  :  0); 
cp->reg[simdinstr . argl]  =  val; 

> 

break; 

case  SHIFTL:  /*  treg  sreg  */ 
if  (ph==2  Aft  IS.ACTIVE) 

cp->reg(sindinstr. argl]  =  (cp->reg[simdinstr .arg2])  «  1; 
break; 

case  SHIFTR:  /*  treg  sreg  */ 
if  (ph*=2  AA  IS.ACTIVE) 

cp->reg[simdinstr.argl]  =  ((unsigned  int) (cp->reg[siadinstr.arg2] ))  »  1; 
break; 

case  SHIFTRA:  /*  treg  sreg  */ 
if  (ph==2  AA  IS.ACTIVE) 

/*  This  is  probably  not  very  portable,  but  produdees  the 
desired  effect  on  a  SPARC  */ 

ep->reg[siadinstr. argl]  =  (cp->reg[si*dinstr .arg2] )  »  1; 
break; 

The  following  operations  make  key  use  of  the  PORTNO  assignment  for  cells  which  allocates  a  unique 
wire  to  each  cell  on  a  port  bus. 

—  If  mark  is  set  on  pointed  at  cell,  stay  active. 
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cut  TSTMARKPTR :  /♦  mark,  source-ptr  */ 

/*  Rote:  the  lolloving  is  done  by  all  cells  */ 
i 1  (ph==l)  { 

il  (IS_ VALID  Aft  TSTBITI (cp->narks , sindinstr . argl ) )  { 
val  =  MSKBITI(cp->portno) ; 

lor  (r=0;  r<PORTCRT;  r++)  SETBIT(cp->ports[r]->portval,val) ; 

>’ 

> 

else  {  /*  ph==2  */ 
il  (IS_ACTIVE)  {. 

il  (r  =  cp->reg [sindinstr. axg2l ,VALIDADDR(r) )  < 
tile  *  TILE(r) ; 

dir  =  VHICHDIR(TILE(cp->sell),tile); 
il  (dir  ==  UIDEF)  { 

il  (sindinstr. arg2==LEFT)  SETBIT(cp->state,STATE_REHOTE_LEFT) ;  else 
il  (sindinstr. arg2==RIGHT)  SETBIT(cp->state,STATE_REHOTE_RIGHT) ; 
DEACTIVATE; 

>  else  ■{ 

val  =  PORTRO(r); 

il  ( ! TSTBITI (cp->ports [dir] ->port val , val ) ) 

DEACTIVATE; 

> 

>  else  DEACTIVATE; 

> 

> 

break; 

—  If  mark  ts  clear  on  pointed  at  cell,  stay  active. 
case  TSTIOTMARKPTR :  /*  nark,  soorce-ptr  */ 

/*  Almost  identical  to  the  above  */ 

/*  Rote;  the  lolloping  is  done  by  all  cells  */ 
il  (ph==l)  { 

il  (IS.VALID  At  TSTBITR(cp->narks, sindinstr. argl))  { 
val  =  MSKBITR(cp->portno) ; 

lor  (r=0;  r<PORTCRT;  r++)  SETBIT(cp->ports[r]->portval,val) ; 

> 

> 

else  {  /*  ph==2  */ 
il  (IS.ACTIVE)  { 

il  (r  =  cp->reg [sindinstr. arg2] ,VALIDADDR(r) )  { 
tile  =  TILE(r) ; 

dir  =  VHICHDIR(TILE(cp->sell),tile); 
il  (dir  ==  UIDEF)  < 

if  (sindinstr. arg2==LEFT)  SETBIT(cp->state,STATE_REMOTE_LEFT) ;  else 
il  (sindinstr. arg2==RIGHT)  SETBIT(cp->state,STATE_REMOTE_RIGHT) ; 
DEACTIVATE; 

>  else  { 

val  =  PORTRO(r) ; 

il  (TSTBITI (cp->ports[dir]->portval, val))  —  No  !. 

DEACTIVATE; 

> 

>  else  DEACTIVATE; 

} 

> 

break; 
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—  Set  mark  on  pointed  ai  cell. 

case  SETMARKPTR :  /*  mark,  targat-ptr  */ 
i i  (ph==l)  { 

if  (IS.ACTIVE)  { 

if  (r  =  cp->reg[simdinstr .arg2] , VALIDADDR(r))  { 
dir  =  VHICHDIR(TILE(cp->self ) ,TILE(r) ) ; 
if  (UIDEF  ==  dir)  { 

if  ( a imdinstr . arg2==LEFT )  SETBIT(cp->state,STATE_REHOTE_LEFT) ;  alee 
if  (simdinstr.arg2==RIGHT)  SETBIT(cp->state,STATE_RENOTE_RIGHT) ; 
DEACTIVATE; 

>  else  { 

SETBIT(cp->ports [dir] ->port»al , HSKBITI (PORTIO(r ) ) ) ; 

> 

>  else  DEACTIVATE; 

> 

>  else  {  /*  ph==2  */ 
if  (IS. VALID)  { 

val  =  MSKBITN(simdinstr.argl) ; 
ii  (!(TSTBIT(cp->»arka,var ))  ( 

mask  =  val; 

val  =  HSKBITM(cp->portno) ; 
for  (r=0;  r<PORTCIT;  r++) 

if  (TSTBIT(cp->ports[r]->portval, val))  { 

SETBIT(cp->marks .mask) ; 
break; 

> 

> 

> 

> 

break; 

—  Clear  mark  on  pointed  at  cell. 

case  CLRHARKPTR:  /*  mark,  target-ptr  */ 

/*  Almost  identical  to  the  above  */ 
if  (ph==l)  { 

if  (IS.ACTIVE)  i 

if  (r  =  cp->reg[simdinstr.arg2] , VALIDADDR(r))  ■{ 
dir  =  WHICHDIR(TILE(cp->self ) ,TILE(r)) ; 
if  (UHDEF  ==  dir)  { 

if  (simdinstr.arg2==LEFT)  SETBIT(cp->state,STATE_REMOTE_LEFT) ;  else 
if  (simdinstr.arg2==RIGHT)  SETBIT(cp->state,STATE_REMOTE_RIGHT) ; 
DEACTIVATE; 

>  else  i 

SETBIT(cp->ports [dir] ->portval ,MSKBITI(PORTIO(r ) ) ) ; 

} 

>  else  DEACTIVATE; 

> 

>  else  {  /*  ph==2  */ 
if  (IS. VALID)  { 
val  =  MSKBITKsimdinstr . argl) ; 
if  (TSTBIT(cp->marks  ,val) )  •(  —  No  !. 

mask  =  val; 

val  =  MSKBITR(cp->portno) ; 
for  (r=0;  r<P0RTCIT;  r++) 

if  (TSTBIT(cp->ports[r]->portval,val))  { 
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CLRBIT(cp->marks .mask) ;  —  Versus  SETBIT 

break; 

> 

> 

> 

> 

break; 

The  following  group  consists  of  port  arbitration  instructions,  and  fetch/store  variants. 

—  Arbitrate  on  specified  port. 
case  ARBPORT:  /*  dir  */ 

il  (IS.ACTIVE)  { 

dir  =  simdinstr. argl; 
il  (ph==l)  { 

il  ( ! VALIDDIR(dir ) )  { 

CLRBIT(cp->state,STATE_TRYIIG) ;  /*newer*/ 

DEACTIVATE; 

>  else 

/*  This  requires  that  the  ports  be  cleared  to  start  with  */ 

SETBITI (cp->ports [dir] ->port val , cp->portno ) ; 

>  else  {  /*  ph  ==2  */ 

/*  lover  numbered  bits  have  higher  priority  +/ 

il  ((cp->ports[dir]->portval)  ft  LOWERB ITS ( cp->portno ) )  DEACTIVATE; 

else 

CLRBIT(cp->state, STATE  TRYIIG) ;  /*newer*/ 

> 

> 

break; 

—  Arbitrate  on  port  given  by  pointer. 
case  ARBPORTPTR :  /*  reg  */ 

il  (IS_ACTIVE  At  (r  =  cp->r eg [simdinstr . argl] , VALIDADDRfr ) ) )  { 
tile  =  TILE(r) ; 

dir  =  VHICHDIR(TILE(cp->sell),tile); 
il  (ph==l)  { 

il  (dir  ==  UIDEF)  { 

il  (simdinstr. argl==LEFT)  SETBIT ( cp->state , ST ATE_REMOTE_LEFT) ;  else 
il  (simdinstr. argl==RIGHT)  SETBIT(cp->state , STATE_REMOTE_RIGHT) ; 
CLRBIT(cp->state, STATE. TRYIIG) ;  /*newer*/ 

DEACTIVATE; 

>  else 

SETBITI (cp->ports [dir] ->portval , cp->portno) ; 

>  else  {  /*  ph  ==2  */ 

il  ((cp->ports[dir]->portval)  ft  LOWERB ITS ( cp- >portno ) )  DEACTIVATE; 
else 

CLRBIT ( cp->stat e , ST ATE.TRYIIG ) ;  /*newer*/ 

> 

> 

break; 

—  Assumes  arbitration. 

— This  should  be  split  into  two  instructions, 
case  STORE;  /*  target-ptr,  treg,  sreg  */ 

/*  originally  written  using  action  between  ph==l  and  ph==2  */ 
il  (ph==l )  { 
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if  (IS_ACTIVE  U  (r  =  cp->reg [a iadinstr . argl] , VALIDADDR(r ) ) )  { 
til*  =  TILE(r) ; 

dir  =  WHICHDIR(TIL£(cp->s*lf ) .tile) ; 
ii  (dir  ==  UIDEF)  DEACTIVATE; 

•Is*  { 

cp->ports[dir]->portaddr  =  r; 

cp->ports [dir] ->pcrtval  =  cp->reg [a iadinstr. &rg3] ; 

> 

> 

>  ala*  {  /*  ph==2  */ 

/*  Mote:  the  following  is  done  by  all  VALID  cells  */ 
if  (IS_VALID)  { 
val  =  cp->a*lf; 

/*  Check  only  one?  */ 
for  (r=0;  r<PORTCIT;  r++) 

if  (cp->porta [r] ->portaddr  ==  val) 

cp->reg[siadinstr.arg2]  =  cp->ports [r] ->port val ; 

> 

> 

break; 

—  Assumes  arbitration. 

—  MAKEREQ  FETCHDATA  performs  a  fetch. 
case  MAKEREQ:  /*  ptr  */ 

if  (ph==l)  { 

cp->aux  =  0;  /*  Mote:  done  by  all  */ 

if  (IS.ACTIVE  *A  (r  =  cp->reg [s iadinstr .argl] , VALID ADDR(r)) )  { 
tile  =  TILE(r) ; 

dir  =  WHICHDIR(TILE(cp->aelf ) .tile) ; 
if  (dir  ==  UIDEF)  { 

if  ( a iadinstr . arg 1 ==LEFT )  SETBIT ( cp->state , STATE_REMOTE_LEFT ) ;  else 
if  ( s iadinstr . arg 1 —RIGHT )  SETBIT(cp->state,STATE_REMUT?_RIGHT) ; 
DEACTIVATE; 

>  else 

cp->porta [dir] ->portaddr  =  r;  /*  Could  xor  */ 

> 

>  else  {  /■*  ph  ==2  */ 

/*  Vote:  the  following  is  done  by  all  cells  */ 
val  =  cp->self; 

/*  Check  only  one?  */ 
for  (r=0;  r<P0RTCMT;  r++) 

if  ( cp- >port s [r] ->port addr  ==  val) 

SETBITI ( cp->aux , r ) ; 

> 

break; 

case  FETCHDATA:  /*  treg,  source-ptr,  sreg  */ 
if  (ph==l)  { 

/*  Mote:  the  following  is  done  by  all  cells  */ 
if  (IS.VALID)  { 
val  =  cp->aux; 
if  (val  !=  0) 

for  (r=0;  r<P0RTCIT;  r++) 
if  (TSTBITM(val.r))  < 

cp->ports[r]->portaddr  =  cp->self ; 

cp->ports [r] ->portval  =  cp->reg[siadinstr .arg3] ; 
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>  else  {  /*  ph  --1  */ 

if  (IS.ACTIVE  kk  (r  =  cp->reg[sindinstr .arg2] , VALIDADDR(r) ))  { 
tile  =  TILE(r) ; 

dir  =  WHICHDIR(TILE(cp->self),tile); 
i f  (dir  ==  UMDEF)  DEACTIVATE; 

else 

if  (cp->ports[dir]->portaddr  ==  r) 

cp->reg [sindinstr.argl]  =  cp->ports[dir]->portval; 
else  { 

printid(cp) ; 

printf("  f etchd&ta:  bad  data  portaddr  =  "); 

printaddr(cp->ports [dir] ->portaddr) ; 

print! (*'  portval  =  '/.d\n" ,  cp->ports  [dir]  ->portval) ; 

> 

> 

> 

break; 

—  No  arbitration  needed,  activate  by  PORTNO. 
case  STOREPI:  /*  portno,  target-ptr,  treg,  sreg  */ 

/*  originally  vritten  using  action  betueen  ph==l  and  ph==2  */ 
if  (ph==l)  { 

if  (IS.ACTIVE  kk 

(sindinstr.argl  ==  cp->portno)  kk 
(r  =  cp->reg [a  indins tr . arg2] , VALIDADDR(r) ) )  { 
tile  =  TILE(r) ; 

dir  =  WHICHDIR(TILE(cp->self ) .tile) ; 
if  (dir  **  UXDEF)  { 

if  ( s indinstr . arg2==LEFT )  SETBIT ( cp-> state , STATE_REMOTE_LEFT ) ;  else 
if  ( s indinstr . arg2==RIGHT )  SETBIT(cp->state,STATE_REMOTE_RIGHT) ; 
DEACTIVATE; 

>  else  { 

cp->ports[dir]->portaddr  =  r; 

cp->ports [dir] ->portval  =  cp- >reg[s indinstr. arg4] ; 

> 

> 

>  else  {  /*  ph==2  */ 

/*  Rote:  the  following  is  done  by  all  cells  */ 
if  (IS.VALID)  { 
val  *  cp->self; 

/+  Check  only  one?  */ 
for  (r=0;  r<PORTCIT ;  r++) 

if  (cp->ports[r]->portaddr  ==  val) 

cp->reg[8imdinstr.arg3]  =  cp->ports [r] ->portval ; 

> 

> 

break; 

—  No  arbitration  needed,  activate  by  PORTNO. 
case  FETCHPR :  /*  portno,  treg,  source-ptr,  sreg  */ 
if  (ph==l)  { 
if  (IS_ VALID)  { 

if  (sindinstr.argl  ==  cp->portno)  { 
val  =  cp->reg[simdinstr  .arg4]  ; 


for  (r=0;  r<PORTCHT;  r++)  { 

cp->ports[r]->portaddr  =  cp->self; 
cp->ports[r]->portval  =  val; 

> 

> 

> 

>  else  i  /*  ph==2  */ 

if  (IS_ACTIVE  kk  (val  =  cp->reg[simdinstr .arg3] , VALIDADDR(val) ) )  { 
for  (r=0;  r<P0RTCIT;  r++) 

if  ( cp->port s [r] ->portaddr  ==  val)  { 

cp->reg[simdinstr.arg2]  =  cp->ports[r]->portval; 
break; 

> 

> 

> 

break; 

This  initiates  global  feedback.  If  anything  is  active,  then  global  feedback  bit  should  eventually  be  set. 

case  SEIDGLOBAL; 

if  (IS.ACTIVE  kk  ph==2)  { 
globalf eedback++ ; 

> 

break ; 

These  are  the  row/col  instructions  with  associated  global  actions.  The  normal  sequence  is  ARBTILE 
ARBCOL  SELROW  ARBROV  SELCELL,  and  then  perform  global  operations. 

case  ARBTILE; 
if  (IS.ACTIVE)  ■( 
if  (ph==l)  { 

SETBITI ( cp->port  s [I0RTH] ->port val , cp->portno ) ; 

>  else  {  /*  ph  --2  */ 

/*  loser  numbered  bits  have  higher  priority  */ 

if  ((cp->ports[*ORTH]->portval)  t  LOWERBITS(cp->portno) )  DEACTIVATE; 

> 

} 

break; 

case  ARBCOL: 

if  (IS.ACTIVE)  { 
if  (ph==l)  { 

SETBITI (cp->col->colval .ROW (cp->self ) ) ; 

>  else  {  /*  ph==2  */ 

if  ( ( cp->col->colval )  *  LOVERBITS (ROW ( cp->self ) ) )  DEACTIVATE; 
else  cp->col->colosner  =  cp->self; 

> 

> 

break; 

case  SELROW:  /*  target  */ 
if  (ph==2  kk  IS.ACTIVE)  < 

if  (r  =  cp->reg[simdinstr.argl] , VALIDADDR(r) )  < 
if  (cp->col->colosner  ==  cp->self) 
cp->col->colval  =  ROW(r); 
else  { 
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print id (cp) ; 

printf ("  selrow  bad  active  cell\n"); 

DEACTIVATE; 

> 

>  else  DEACTIVATE; 

} 

break; 

case  ARBROW:  /*  target  */ 
if  (ph==2  kk  IS.ACTIVE)  { 

if  (r  =  cp->reg[siBdinstr.argl] .VALIDADDR(r))  { 
if  (cp->col->coloraer  ==  cp->self)  { 
if  (TSTBIT(cp->col->colval,FAILFLAG)) 

DEACTIVATE; 

>  else  { 
printid(cp) ; 

printf ("  arbrow  bad  active  cell\n“); 

DEACTIVATE; 

> 

>  else  DEACTIVATE; 

> 

break; 

case  SELCELL:  /*  ptr  */ 
if  ( IS _ ACTIVE  kk  ph==2)  { 

if  (r  =  cp->reg[si*dinstr.argl] , VALID ADDR(r))  { 
if  (cp->col->coloener  !=  cp->self  II 
ro«[ROW(r)] .rovoener  !=  cp->self)  { 
printid(cp) ; 

printf("  selcell:  not  oener\n"); 
printcell(cp) ; 

printaddr(cp->col->colowner) ;  printf ("\n") ; 
print addr (row [ROW(r)] .rowovner) ;  printf ("\n") ; 
printcelladdr(row[ROW(r)] .row owner) ; 

>  else  ■( 

CELLAT(r) .aux  =  cp->self; 

> 

>  else  DEACTIVATE; 

> 

break; 

case  FETCHGLBL:  /*  target  source-ptr  reg  */ 

if  (IS_ACTIVE  kk  (r  =  cp->reg[sindinstr.arg2] ,VALIDADDR(r) ) )  { 
if  (cp->col->colowner  !=  cp->self  II 
row[ROW(r)] .rowowner  !=  cp->self  II 
CELLAT(r) .aux  !=  cp->self)  { 
if  (ph==l)  { 
printid(cp) ; 

printf ("  fetchglbl:  not  owner\n") ;  > 

>  else  { 

if  (ph==2)  { 

/*  no  cell  arbitration  is  necessary  */ 

/*  only  really  use  the  nnnber  of  the  cell  from  the  pointer  */ 
cp->reg[sindinstr.argl]  =  CELLAT(r).reg[siadinstr.arg3] ; 

} 

> 

} 
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break; 

case  STOREGLBL :  /*  target-ptr  reg  source  */ 

/*  as  usual  very  similar  to  the  above  */ 

if  (IS_ ACTIVE  kk  (r  =  cp->reg[simdinstr.argl] .VALIDADDR(r)))  { 
if  (cp->col->colouner  !=  cp->self  II 
rouCROV(r)] . rovovner  !=  cp->self  II 
CELLAT(r) . aux  !=  cp->self)  { 
if  (ph==l)  { 
printid(cp) ; 

printf ("  storeglbl:  not  owner\n");  } 

>  else  { 

if  (ph==2)  { 

/*  no  cell  arbitration  is  necessary  */ 

/*  if  (verbose)  { 

printf ("storeglbl:  "); 
printaddr(r) ; 

printf  ( "  ‘/.d  " ,  siadinstr .  arg2 ) ; 
printf (“  <-  %d\nM , cp->reg [s iudinstr . arg3] ) ;  >  */ 
CELLAT (r).reg(s iadins  tr . arg2]  =  cp->reg[siadinstr .arg3] ; 

> 

> 

> 

break; 

case  TSTKARKGLBL:  /*  source-ptr  mark  (changed  order)  */ 

/*  as  usual  very  similar  to  the  above  */ 

if  (IS.ACTIVE  kk  (r  =  cp->reg[si»dinstr . argl] , VALIDADDE(r)) )  { 
if  (cp->col->colowner  !=  cp->self  II 
ro«[ROV(r)] .rosouner  !=  cp->self  It 
CELLAT(r)  .aux  !=  cp->self)  ■{ 
if  (ph==l)  { 
printid(cp) ; 

printf  ('*  tstmarkglbl:  not  owner\n" ) ;  > 

>  else  i 

if  (ph==2)  { 

/*  no  cell  arbitration  is  necessary  */ 

if  ( ITSTBITK CELLAT (r) . marks, siadinstr .arg2)) 

DEACTIVATE; 

> 

> 

> 

break; 

case  TSTBOTMARKGLBL :  /*  source-ptr  mark  */ 

/*  as  usual  very  similar  to  the  above  */ 

if  (IS_ACTIVE  kk  (r  =  cp->reg[simdinstr.argl] ,VALIDADDR(r)))  { 
if  (cp->col->colowner  !=  cp->self  II 
row[ROU(r)] .roHouner  !=  cp->self  II 
CELLAT(r) .aux  !=  cp->self)  { 
if  (ph==l)  ■( 
print id (cp) ; 

printf ("  tstnotmarkglbl:  not  owner \n") ;  > 

}  else  { 

if  (ph==2)  { 

/*  no  cell  arbitration  is  necessary  */ 
if  (TSTBITV(CELLAT(r) . marks, simdinstr.arg2)) 
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DEACTIVATE; 


> 

> 

> 

break; 

case  SETMARKGLBL:  /*  target-ptr  nark  */ 

/*  as  usual  very  siailar  to  the  above  */ 

if  (IS.ACTIVE  Aft  (r  =  cp->reg [siadinstr. argl] , VALID ADDR(r) ) )  { 
if  (cp->col->colowner  !=  cp->self  II 
row[R0W(r)3 .rowowner  *=  cp->self  II 
CELLAT(r) .aux  !=  cp->self)  { 
if  (ph==l)  { 
print id (cp) ; 

printf("  setnarkglbl:  not  owner  “);  } 

>  else  { 

if  (ph==2)  i 

/*  no  cell  arbitration  is  necessary  */ 

SETB ITI ( CELLAT ( r ) .Barks , siadinstr . arg2) ; 

> 

> 

> 

break; 

case  CLRMARKGLBL :  /*  target-ptr  Bark  */ 

/*  as  usual  very  sinilar  to  the  above  */ 

if  (IS_ACTIVE  AA  (r  =  cp->reg [siadinstr . argl] , VALIDADDR(r) ) )  { 
if  (cp->col->colowner  !=  cp->self  II 
row [ROW (r)] . rowouner  !=  cp->self  II 
CELLAT (r) .aux  ! =  cp->self)  { 
if  (ph==l)  { 
printid(cp) ; 

print! ("  clrBarkglbl:  not  owner  ");  } 

>  else  { 

if  (ph==2)  { 

/*  no  cell  arbitration  is  necessary  */ 

CLRB I TK ( CELLAT ( r ) .narks ,sind ins tr.arg2) ; 

> 

> 

> 

break; 

case  INCRGLBL:  /*  val,  target-ptr  */ 

if  ( IS_ ACTIVE  AA  (r  =  cp->reg [siadinstr . arg2] , VALIDADDR(r ) ) )  { 
if  (cp->col->colowner  !=  cp->sslf  II 
row [R0W(r)3 .rowowner  !=  cp->self)  { 

if  (ph==l)  -C 
printid(cp) ; 

printf("  incrglbl:  not  owner \n");  > 

>  else  { 

if  (ph==2)  { 

/*  no  cell  arbitration  is  necessary  */ 

CELLAT(r) .refc  +=  simdinstr .argl ; 

> 

> 

> 

break; 
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This  group  allows  general  access  to  the  bits  of  the  state  register,  but  is  probably  not  how  this  state  testing 
should  be  implemented. 

/*  lots:  the  following  work  with  bit  aasks  */ 
case  TSTSTATE :  /*  flags  */ 

if  (ph«2  ftft  IS.ACTIVE  kk  !TSTBIT(cp->state, siadinstr. argl)) 

DEACTIVATE; 

break; 

case  TSTIOTSTATE :  /*  flags  */ 

if  (ph==2  kk  IS_ACTIVE  kk  TSTBIT(cp->state, siadinstr. argl)) 

DEACTIVATE; 

break; 

case  SETSTATE:  /*  flags  */ 
if  (ph==2  kk  IS_ ACTIVE) 

S£TBIT(cp->state,simdinstr.argl) ; 
break; 

case  CLRSTATE:  /*  flags  */ 
if  (ph==2  kk  IS.ACTIVE) 

CLRBIT(cp->state, siadinstr . argl) ; 
break; 

Here  is  another  group  of  relatively  simple  instructions  which  are  tests  on  addresses/pointers  and  the 
reference  count  register. 

case  TSTLOCAL:  /*  ptr  */ 
if  (ph==2  Aft  IS.ACTIVE)  { 

if  (!((r  =  cp~>reg [siadinstr. argl] .VALIDADDR(r))  ftft 
UIDEF  !=  WHICHDIR(TILE(cp->self ) ,TILE(r)))) 

/*  deactivate  if  not  a  valid  address  or  not  local  */ 

DEACTIVATE; 

> 

break; 

case  TSTREMOTE :  /*  ptr  */ 
if  (ph==2  ftft  IS_ACTIVE)  { 

if  (!((r  =  cp->reg [siadinstr. argl] .VALIDADDR(r) )  ftft 
UIDEF  ==  ¥HICHDIR(TILE(cp->self ) .TILE(r) ) ) ) 

/*  deactivate  if  not  a  valid  address  or  local  •/ 

DEACTIVATE; 

} 

break; 

case  TSTADDR:  /*  ptr  */ 

if  (ph==2  ftft  IS.ACTIVE  ftft  !VALIDADDR(cp->reg[siadinstr.argl] )) 

/*  deactivate  if  actually  is  a  valid  address  */ 

DEACTIVATE; 

break; 

case  TSTSOTADDR:  /*  ptr  */ 

if  (ph==2  ftft  IS.ACTIVE  ftft  VALIDADDR( cp->r eg [siadinstr. argl] )) 

/*  deactivate  if  actually  is  a  valid  address  */ 

DEACTIVATE; 

break; 

case  TSTUHUSED : 

if  (ph==2  ftft  IS.ACTIVE  ftft  0  •=  cp->refc) 

DEACTIVATE; 

break; 
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case  TSTURSHARED : 

it  (ph==2  U  IS.ACTIVE  ft*  1  !  =  cp->reic) 

DEACTIVATE; 

break; 

case  TSTI1USE: 

ii  (ph==2  Aft  IS.ACTIVE  ftft  cp->relc  <=  0) 

DEACTIVATE; 

break; 

These  were  designed  for  use  with  explicit  port  arbitration  (ARBPQRT,  ARBPQRTPTR),  allowing  a  relatively 
tight  “repeat  until  all  done”  loop. 


case  SETTRYIRG: 
ii  (ph==2)  { 

ii  (IS_ACTIVE)  { 

SETBIT(cp->state,STATE_TRYIIG) ; 
global! eedback++ ; 

> 

else  CLRBIT(cp->state,STATE_TRYIIG) ;  /*  When  clear?  */ 

> 

break; 

case  IIITTRY1RG : 
ii  (ph==2)  { 

ii  (TSTALLBIT(st .STATE.VALID 1 STATE_TRYIBG) )  < 

ACTIVATE; 
globalieedback++ ; 

>  else  { 

DEACTIVATE; 

> 

> 

break; 

case  CLRTRYIRG : 

ii  (ph==2  ft*  IS.ACTIVE)  { 

CLRBIT ( cp->stat e , STATE.TRYIRG ) ; 

> 

break; 

These  are  reference  count  operations  (there  is  also  a  global  version), 
case  IRCR:  /*  val  */ 

ii  (IS.ACTIVE  ftft  ph==2)  cp->reic  +=  simdinstr . argl; 
break; 

case  IRCRPTR1:  /*  val,target-ptr  */ 
case  IRCRPTR2: 
ii  (ph==l)  { 

ii  (IS.ACTIVE  ft* 

(val  =  HUH(cp->seli ) , (op==IRCRPTRl  ?  0<=val  ftft  val<=l  : 

2<=val  ftft  val<=3))  ftft 

(r  =  cp->reg[simdinstr.arg2] .VALIDADDR(r)))  { 
dir  =  WHICHDIR(TILE( cp->self ) ,TILE(r ) ) ; 
ii  (dir  ==  UHDEF)  { 

ii  (siedinstr .argl==LEFT)  SETBIT(cp->state,STATE_REMOTE_LEFT) ;  else 
ii  (siadinstr .argl=-RIGHT)  SETBIT(cp->state,STATE_REMOTE_RIGHT) ; 
DEACTIVATE; 
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>  else  { 

SETBIT ( cp->port» [dir] ->portval , 

((val==Ol |val==2)?  MSKBITI(PORTIO(r))  : 

MSKBITI (PORTIO(r)+PORTIQSHIFT) ) ) ; 

> 

> 

}  else  {  /*  ph==2  */ 

/*  IS.VALID  ??  */ 

val  =  MSKBITI (cp->portno)  I  MSKBITI ( cp->portno+PORTIOSIIFT) ; 
lor  (r=0;  r<PORTCIT;  r++) 

il  (TSTBIT(cp->ports [r] ->portval , val) )  { 

il  (TSTBIT(cp->ports [r] ->port val , MSKBITI (cp->portno) ) ) 
cp->relc  +=  simdinstr.argl; 

il  (TSTBIT(cp->ports [r] ->port val, MSKBITI (cp->portno+PORTIOSHIFT) ) ) 
cp->rslc  +=  simdinstr . argl; 

> 

> 

break; 

These  are  the  allocation  instructions.  The  normal  idiom  is  ALLOCREQ  AVAIL1  AVAIL2  ALLOCPTR.  This 
idiom  will  leave  a  pointer  to  a  free  cell  in  the  specified  register  (or  0,  null  pointer,  if  fails),  but  the  cell  is 
still  not  allocated.  ALLOCPTR  is  used  to  actually  allocate  the  ceil.  Many  of  the  details  could  be  changed  and 
improved  here. 

case  ALLOCREQ: 
il  (ph==l)  { 
cp->aux  =  0; 
il  (IS.ACTIVE)  { 

val  *  MSKBITI ( cp->portno) ; 
lor  (r=0;  r<P0RTCIT;  r++) 

SETBIT(cp->ports [r] ->portval , val) ; 

> 

>  else  <  /*  ph==2  */ 

il  ( !TSTBIT(cp->state,STATE_VALID))  { 
val  «  0; 

■ask  =  PORTIOCASE(cp->portno); 
lor  (r=0;  r<P0RTCIT;  r++) 

val  =  (val  «  CELLSPERTILE)  I 

PORTIQSEL(mask , cp->ports Cr] ->portval ) ; 
cp->aux  =  val;  /*  Could  use  ACC  instead  */ 

> 

> 

break; 

case  AVAIL1:  /*  reg  */ 
case  AVAIL2 : 
il  (ph==l )  { 
il  (cp->aux  Aft 

(val  =  IUM(cp->sell),(op==AVAILl  ?  0<=val  ftft  val<=l  : 

2<=val  ftft  val<=3)))  { 

mask  =  val; 

val  =  LOWESTBIT(cp->aux); 
lor  (r=P0RTCIT-l;  0<=r;  r--)  { 

il  (val  ft  LOVERBITS (CELLSPERTILE))  { 
val  =  val  ft  LOWERBITS (CELLSPERTILE) ; 
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val  =  PORTIOREPLY  ( cp->p>ortno ,  val ) ; 
cp->ports [r] ->port val  I  = 

((iuk==0  ||  «ask==2)  ?  val  :  val  «  PORTIOSHIFT) ; 
break; 

> 

val  »=  CELLSPERTILE; 

> 

cp->aux  *  0, 

> 

>  else  {  /*  ph==2  */ 
i f  (IS_ ACTIVE)  { 

if  (op==AVAILl)  cp->reg[sisKlinstr  .argl]  *  0;  /*??*/ 
val  =  MSKBITI ( cp->portno )  I  HSKBITR(cp->portno+PaRTIOSHIFT) ; 
if  ( ! (op==AVAIL2  At  cp->reg[si«dinstr .argl] !=0) )  { 
for  (r=0;  r<P0RTCIT;  r++) 

if  (TSTBIT ( cp->porta [r] ->port val , val ) )  { 
if  (TSTBIT(cp->port8[r]->portval, 

HSKBITR(cp->portno)))  { 
if  (op=sAVAILl)  a  =  0;  else  a  =  2; 

> 

else 

if  (TSTBIT(cp->ports[r]->portval, 

MSKBITR(cp->portno+PORTROSHIFT) ) )  { 
if  (ops-AVAILl)  a  =  1 ;  elsa  a  *  3; 

> 

val  =  cp->ialf ; 
switch  (r)  { 

case  EAST:  val  =  ADDR(ROV(val) ,C0L(val)+l ,a) ;  break; 
case  I0RTH:  val  =  ADDR(ROV(val)-l ,C0L(val) ,a) ;  break; 
case  WEST:  val  *  ADDR(ROW(val) .COL(val)-l.a) ;  break; 
case  SOUTH:  val  *  ADDR(ROW(val)+l,COL(val) ,a) ;  break; 

> 

cp->rag [simdinatr . argl]  =  val; 
break; 

> 

> 

if  (op==AVAIL2  AA  cp->reg[siadinstr.argl]  ==  0)  DEACTIVATE; 

> 

> 

break; 

case  ALLOCPTR:  /*  target-ptr  */ 
if  (ph==l)  { 
if  (IS.ACTIVE) 

if  (Cr  =  cp->reg[si*dinstr.argl] .VALIDADDR(r)))  { 
dir  =  VHICHDIR(TILE(cp->self ) ,TILE(r) ) ; 
SETBIT(cp->portsCdir]->portval ,HSKBITI(PORTIO(r) ) ) ; 

>  else  DEACTIVATE; 

>  else  {  /*  ph==2  */ 

/*  Rote:  the  following  is  done  by  all  cells  */ 
if  ( !TSTBIT(cp->state, STATE. VALID))  { 
cp->refc  =  0; 

val  =  MSKBITI (cp->portno) ; 
for  (r=0 ;  r<P0RTCIT;  r++) 

if  (TSTBIT( cp->ports [r] ->portval , val) )  { 
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SETBIT(cp->state, STATE. VALID) ; 
cp->refc++;  /*  should  really  just  do  ones  */ 

> 

> 

> 

break; 

These  are  the  commit  instructions. 

case  COMMIT: 

/*  failure  flag?  */ 
if  (ph=-2  At  IS.ACTIVE)  { 
for  (r*0;  r<REGIUM;  r++)  { 
val  =  cp->reg[r] ; 

cp->reg[r]  =  cp->reg [r+REGIUM] ;  /*??*/ 
cp->reg [r+REGIUM)  =  val; 

> 

/*  This  should  result  in  eventual  deletion  of  next  stuff  */ 
SETBIT ( cp->state , STATE. COMMIT ) ; 

> 

break; 

case  COMMITMARK:  /*  nark  */ 
if  (ph==2  Aft  IS.ACTIVE)  { 

if  (TSTBITI  (  cp->siarks ,  s  iadinstr .  argl ) )  { 
for  (r=0 ;  r<REGIUM;  r++)  < 
val  =  cp->reg[r]; 

cp->reg[r]  =  cp->reg [r+REGIUM] ;  /*??*/ 
cp->reg [r+REGIUM]  »  val; 

> 

> 

SETBIT ( cp->stat e , STATE. COMMIT ) ; 

> 

break; 
case  SWAP: 

if  (ph==2  kk  IS.ACTIVE)  { 
for  (r=0;  r<REGIUM;  r++)  { 
val  =  cp->reg[r] ; 

cp->reg [r]  =  cp->reg [r+REGIUM] ;  /*??*/ 
cp->reg [r+REGIUM]  *  val; 

> 

> 

break; 

case  SETCOMMIT: 

if  (ph==2  kk  IS.ACTIVE)  { 

SETBIT (cp->state,STATE_COMMIT) ; 

> 

break; 

Miscellaneous  other  intructions: 

—  In  real  ensemble,  expand  to  code  sequence. 
case  DELETE: 

if  (ph==2  Aft  IS.ACTIVE)  { 


30 


cp->aarks  =  0; 
cp->refc  =  0; 
cp->state  =  0; 
cp->aux  =  0; 

for  (r=0;  r<REGCIT;  r++)  cp->r«g[r]  =  0; 

> 

break; 

—  Initiate  relocation  if  non-local. 
case  RELOCREQ:  /*  ptr  +/ 

i f  (ph==2  kk  IS.ACTIVE)  { 

if  ( (simdinstr . argl  ==  LEFT  II  simdinstr.argl  ==  RIGHT)  kk 
(r  =  cp->r eg [simdinstr.argl] , VALIDADDR(r))  kk 
UIDEF  ==  WHICHDIR(TILE(cp->aalf),TILE(r)))  { 
if  (simdinstr.argl  =-  LEFT)  SETBIT(cp->atata,STATE_REMOTE_LEFT) ;  else 
SETBIT(cp->state, STATE..  REHOTE_RIGHT); 

DEACTIVATE; 

> 

> 

break; 

case  SETRAIDOM:  /*  reg  */ 
if  (ph==2  kk  IS.ACTIVE)  { 

cp->reg [simdinstr.argl]  =  randoomO;  —  Own  random  function. 

> 

break; 

—  May  not  be  usable  because  of  some  IS.VALID  tests. 
case  HIT  ALL : 

if  (ph==2)  ACTIVATE; 
break; 

case  TSTBLACK : 

if  (IS.ACTIVE  kk  ph==2) 

if  ( PORTKOCASE ( cp->portno ) ) 

DEACTIVATE; 
break; 
case  TSTRED: 

if  (IS.ACTIVE  kk  ph==2) 

if  ( ! PORTIOCASE ( cp->portno ) ) 

DEACTIVATE; 

break; 

case  TSTCLASS:  /*  class  —  one  of  6  cases  */ 
if  (IS.ACTIVE  kk  ph==2)  { 
val  =  cp->self; 

if  ( ! (simdinstr.argl  ==  ((C0L(ral)+3*R0H(val))y,5))) 

DEACTIVATE; 

> 

break; 

case  TSTCELLHUM :  /*  cellnum  */ 
if  (IS.ACTIVE  kk  ph==2)  < 
val  =  cp->self ; 

if  ( ! (simdinstr . axgl  ==  MUM(val))) 

DEACTIVATE; 

> 

break; 

case  CELLARBPTR:  /*  target-ptr  */ 
if  (IS.ACTIVE)  { 
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it  (ph==l)  { 

if  (r  *  cp->reg[si«dinstr .argl] , VALIDADDR(r) )  { 
dir  =  WHICHDIR(TILE(cp->self),TILE(r)); 

SETB IT ( cp->port  s [dir 3 ->portval ,HSKBITI(PORTIO(r) ) ) ; 

> 

}  else  {  /*  ph==2  */ 

if  (r  =  cp->reg [siadinstr . argl] , VALIDADDR (r ) )  { 
dir  =  VHICHDIR(TILE(cp->self ) , TILE(r) ) ; 
if  ( !TSTBIT (cp->ports [dir] ->port»al , KSKBITH (PORTHO(r ) ) ) ) 

DEACTIVATE; 

> 

> 

> 

break; 

The  action  for  an  illegal  op  code  is: 
default: 

printf ("Illegal  op  code  5ld\n",op); 
fflush(stdout) ; 
exit(-l) ; 

> 

> 

The  following  routines  are  used  with  the  S1MD  instruction  SEIDGLOBAL  to  simulate  the  global  feedback 
mechanism. 

/*  special  simd  operations  */ 

resetglobO 

{ 

global! eedback  =  0; 

> 

int 

testglobO 

i 

return  globalfeedback; 

> 
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